Build Parallel Corpus Custom Model¶

Overview¶

Most of the provided translation models in Language Translator can be extended to learn custom terms and phrases or a general style that's derived from your translation data.

The Language Translator service supports two types of customization requests. You can either customize a model with a forced glossary or with a corpus that contains parallel sentences.

Use a forced glossary to force certain terms and phrases to be translated in a specific way.
Use a parallel corpus when you want your custom model to learn from general translation patterns in your samples. What your model learns from a parallel corpus can improve translation results for input text that the model hasn't been trained on.

The general improvements from parallel corpus customization are less predictable than the mandatory results you get from forced glossary customization.

Use a parallel corpus to provide additional translations for the base model to learn from. This helps to adapt the base model to a specific domain. How the resulting custom model translates text depends on the model's combined understanding of the parallel corpus and the base model.

Training data format: TMX (UTF-8 encoded)
Maximum length of translation pairs: 80 source words
Minimum number of translation pairs: 5,000
Maximum file size: 250 MB
You can submit multiple parallel corpus files by repeating the parallel_corpus multipart form parameter as long as the cumulative size of the files doesn't exceed 250 MB.

Note, Make sure that your Language Translator service instance is on an Advanced or Premium pricing plan. The Lite and Standard plans do not support customization.

Objective¶

IBM Watson™ Language Translator allows you to translate text programmatically from one language into another language.

You build a parallel corpus custom model in this section.

Tools Used¶

Watson Language Translator

Requirements¶

IBM Cloud Account

Creating TMX file¶

To provide a glossary or corpus of terms for training the Language Translator service, create a valid UTF-8 encoded document that conforms to the Translation Memory Exchange (TMX) version 1.4 specification. TMX is an XML specification that is designed for machine-translation tools.

Each term and translation pair must be enclosed in <tu> tags:

<tu>
    <tuv xml:lang="en">
    <seg>patent</seg>
    </tuv>

    <tuv xml:lang="fr">
    <seg>brevet</seg>
    </tuv>
</tu>

The xml:lang attribute in the <tuv> tag identifies the language in which a term is expressed, while the <seg> tag contains the term or the translation.

The Language Translator service uses only the <tu>, <tuv>, and <seg> elements. All other elements in the example are required to make a valid TMX file, or are informational, but are not used by the service.

Sampe TMX file¶

The following is a small piece of an English to French parallel corpus that was downloaded from the MultiUN collection available on the OPUS open parallel corpus website. You can download a compressed version of the entire TMX file.

<?xml version="1.0" encoding="UTF-8" ?>
<tmx version="1.4">
<header creationdate="Tue Jan 29 12:49:40 2013"
          srclang="en"
          adminlang="en"
          o-tmf="unknown"
          segtype="sentence"
          creationtool="Uplug"
          creationtoolversion="unknown"
          datatype="PlainText" />
  <body>
    <tu>
      <tuv xml:lang="en"><seg>RESOLUTION 55/100</seg></tuv>
      <tuv xml:lang="fr"><seg>RÉSOLUTION 55/100</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>Adopted at the 81st plenary meeting, on 4 December 2000, on the recommendation of the Committee (A/55/602/Add.2 and Corr.1, para. 94),The draft resolution recommended in the report was sponsored in the Committee by: Bolivia, Cuba, El Salvador, Ghana and Honduras. by a recorded vote of 106 to 1, with 67 abstentions, as follows:</seg></tuv>
      <tuv xml:lang="fr"><seg>Adoptée à la 81e séance plénière, le 4 décembre 2000, sur la recommandation de la Commission (A/55/602/Add.2, par. 94)Le projet de résolution recommandé dans le rapport de la Commission avait pour auteurs les pays suivants: Bolivie, Cuba, El Salvador, Ghana et Honduras., par 106 voix contre une, avec 67 abstentions, à la suite d'un vote enregistré, les voix s'étant réparties comme suit:</seg></tuv>
    </tu>
    ...

Steps¶

To build a parallel corpus custom model,

Go to the terminal window that you have configured in the previous section.
The terminal window should have been ready for making API calls. If not, execute command
```
export apikey=<your API key>
export url=<your url>
```

Identify if a specific model, for example en-fr, supports customization, execute command

curl --user "apikey:$apikey" "$url/v3/models?source=en&target=fr&version=2018-05-01"

It returns the following JSON data. "customizable" : true shows that the model supports customization.

{
    "models" : [ {
        "model_id" : "en-fr",
        "source" : "en",
        "target" : "fr",
        "base_model_id" : "",
        "domain" : "general",
        "customizable" : true,
        "default_model" : true,
        "owner" : "",
        "status" : "available",
        "name" : "en-fr",
        "training_log" : null
    } ]
}

Optionally, you may execute the command below and retrieve all models for customization support.
```
curl --user apikey:$apikey "$url/v3/models?version=2018-05-01"
```
Create your training data.

For this exercise, a TMX file en-fr-6000-ParallelCorpus.tmx is provided in the repo.
Train your custom model.

Use the Create model method to train your model. In your request, specify the model ID of a customizable base model, and training data in the parallel_corpus parameters.
```
curl -X POST --user "apikey:$apikey" --form parallel_corpus=@en-fr-6000-ParallelCorpus.tmx "$url/v3/models?version=2018-05-01&base_model_id=en-fr&name=custom-english-to-french"
```
You can upload multiple parallel_corpus files in one request. All uploaded parallel_corpus files combined, your parallel corpus must contain at least 5,000 parallel sentences to train successfully.

The command returns

{
    "model_id" : "43745eda-7fde-4998-a62a-26cf0e795973",
    "source" : "en",
    "target" : "fr",
    "base_model_id" : "en-fr",
    "domain" : "general",
    "customizable" : true,
    "default_model" : false,
    "owner" : "1e5f399b-605d-4e83-b07e-534da85b86a9",
    "status" : "dispatching",
    "name" : "custom-english-to-french",
    "training_log" : null
}

The API response will contain details about your custom model, including its model ID.

Record the model_id.
```
export MODELID=<model_id>
```
Check the status of the new custom model.

Model training might take anywhere from a couple of minutes (for forced glossaries) to several hours (for large parallel corpora) depending on how much training data is involved. To see if your model is available, use the Get model details method and specify the model ID that you received in the service response of the previous step. Also, you can check the status of all of your models with the List models method.

The following command gets information for the model identified by the model ID $MODELID.
```
curl --user "apikey:$apikey" "$url/v3/models/$MODELID?version=2018-05-01"
```
It returns
```
{
    "model_id" : "43745eda-7fde-4998-a62a-26cf0e795973",
    "source" : "en",
    "target" : "fr",
    "base_model_id" : "en-fr",
    "domain" : "general",
    "customizable" : true,
    "default_model" : false,
    "owner" : "1e5f399b-605d-4e83-b07e-534da85b86a9",
    "status" : "training",
    "name" : "custom-english-to-french",
    "training_log" : null
}
```
The status response attribute describes the state of the model in the training process: - uploading - uploaded - dispatching - queued - training - trained - publishing - available

When the model status is available, your model is ready to use with your service instance.

{
    "model_id" : "43745eda-7fde-4998-a62a-26cf0e795973",
    "source" : "en",
    "target" : "fr",
    "base_model_id" : "en-fr",
    "domain" : "general",
    "customizable" : true,
    "default_model" : false,
    "owner" : "1e5f399b-605d-4e83-b07e-534da85b86a9",
    "status" : "available",
    "name" : "custom-english-to-french",
    "training_log" : null
}

Translate text with your new custom model.

To use your custom model, specify the text that you want to translate and the custom model's model ID in the Translate method.

The following command translates text with the custom model identified by the model ID $MODELID.
```
curl -X POST --user "apikey:$apikey" --header "Content-Type: application/json" --data "{\"text\":\"Hello, Lee Zhang. Please don't park in the alley.\",\"model_id\":\"$MODELID\"}" "$url/v3/translate?version=2018-05-01"
```
It returns
```
{
    "translations" : [ {
        "translation" : "Bonjour, Lee Zhang. Veuillez ne pas vous garer dans l'allée."
    } ],
    "word_count" : 13,
    "character_count" : 49
}
```
Neither Lee Zhang nor alley were translated in the way that you defined in the TMX file en-fr-6000-ParallelCorpus.tmx. Use a parallel corpus when you want your custom model to learn from general translation patterns in your sample TMX file. What your model learns from a parallel corpus can improve translation results for input text that the model hasn't been trained on. However, the general improvements from parallel corpus customization are less predictable than the mandatory results you get from forced glossary customization.

You can apply a forced glossary to a model that has been customized with a parallel corpus.

curl -X POST --user "apikey:$apikey" --form forced_glossary=@en-fr-ForcedGlossary.tmx "$url/v3/models?version=2018-05-01&base_model_id=$MODELID&name=custom-english-to-french-2"