Throughout this post we are going to see how to create an English – Spanish translator using neural networks. For this, we will use Trax, the new Google’s library based on its AI framework Tensorflow.
I would like to warn readers that, while trying to tell things in a simple way, I may be modifying a little the exact meaning of words. Those who have a deep knowledge of the topic, please forgive me ?. On the other hand, it is required to have some knowledge about some basic concepts in order to follow and really understand this post.
What we will see
Some Basic Concepts
What is a generator?
A generator is a Python function that works as an iterator. Basically, it can generate a result, but you can iterate over this function, as it uses a yield
instead of a return
.
In this case we will use generators to provide some data from our dataset to the training. We just need to bear in mind that these are functions we can iterate with, in this case, through our data as in a streaming sequence.
What is a token? What does it mean if it is a subword type?
In the NLP world, words are divided into tokens, which are smaller units words can be divided into.
The most typical ones are word
, char
and subword
.
An example:
- I like eating-> ‘I’ ‘like’ ‘eating’
- I like eating-> ‘I’ ‘l’ ‘i’ ‘k’ ‘e’ ‘e’ ‘a’ ‘t’ ‘i’ ‘n’ ‘g’
- I like eating-> ‘I’ ‘like’ ‘eat’ ‘ing’
NLP uses these three types, but we will use subword
for several reasons:
- While working at a
subword
level, we do not lose meaning but at the same time, vocabulary will be limited. For example, ‘eat’ would be a token that also has a meaning, but if tokens were instead at a word level we could have: eating, eats, ate. - On the contrary, characters would turn vocabulary smaller, but they do not have a meaning by their own, which makes very difficult to establish relations between words.
Another very important term mentioned here is vocabulary, as we will need to have a vocabulary of tokens with those most relevant ones.
What does bucketing mean?
Those models based on sentences, text, etc. (NLP) have many memory requirements. Due to the large amount of data and the variety in sizes, this method helps alleviate this problem.
To understand the bucketing technique, we must know about some basic concepts: about batches, the type of data as it is, and in this case, about padding.
WHAT Is A batch?
When training a model, you send data to it and there are two alternatives for this: you can send data elements one by one or you can send data sets (batches). For example, a 32 batch would be an amount of 32 data elements from our dataset.
Before explaining bucketing, let’s think about how the data we will use to create a translator is:
In order to train a Deep Learning translator model we must follow the same pattern than with any other model based on supervised learning. This means that to train this model, our dataset must have both questions and the answers to these questions, in this case both for the English sentence and its translation into Spanish.
Let’s use a dataset from the TensorFlow catalog, with 21.987.267 sentence pairs (en, es).
We will find sentences such as:
- The dog is big.
- The minimum interprofessional wage in Belgium has been retracted due to the change of strategy in the country’s international relations.
On the other hand, padding consists in adding a token so all sentences have the same size. This way our model can train with homogeneous data.
To understand this, if the shortest sentence in my dataset is:
Hi my name is Iago -> 5 words
And if the longest sentence of this dataset is:
Tomorrow it will rain cats and dogs but I do not care because today I have been given a raincoat that perfectly protects me from rain and humidity -> 28 words
In this case, in order to use the first sentence we would need to add 23 padding tokens, having something like this:
Hi my name is Iago <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
<PAD> <PAD>
With bucketing we can create buckets of different sizes and classify sentences of these sizes. This way we won’t waste so much memory adding <PAD>
to each sentence. So if I have a bucket of 8 tokens, the first sentence would turn into this:
Hi my name is Iago <PAD> <PAD> <PAD>
We can afford a considerable amount of memory this way, working with so many data.
Something else to take into account: other concepts
Despite being a lot of information and having in mind that we could write books about each concept, it is important to describe this process because our machines only understand about figures, not words or tokens. For this reason, before sending the info to our training, we must turn these tokens into figures thanks to our vocabulary.
So, if our vocabulary is: a (1), my (2), car (3), rolling (4), smile (5), junk (6), is (7), of (8), unknown_word (9).
The sentence: my car is a rolling piece of junk
Would become: 2 3 7 1 4 9 8 6
And this process, though in a more complete and complex way, would be repeated on every piece of data used in the training and then in the inference (this is the process when you ask the model to predict the translation, in this case).
Transformers
Transformers are a type of neural network architecture used on those models based on language (NLP).
A transformer is, in the case of a translator system, something that will turn a sentence written in any language into any other language. To explain this, I will use this great post by Jay Alammar.
He explains this visually starting with an input and an output (the desired one) and considering this transformer as a black box that we will decode little by little. We will leave it in a brief explanation to get all hands on deck as soon as possible.
If we would get inside this black box, we would see an encoding component and another decoding component:
Both components are stacks of encoders and decoders (the same amount in both cases):
Encoders have always the same structure, as in the case of decoders. Encoders are divided into two different layers:
- Self-attention layer: helps the encoder to put the attention on those relevant words. More info.
- Feed forward neural network: a simpler architecture inside neural networks.
Decoders are divided into three layers:
- Self-attention layer
- Encoder-Decoder Antention: helps decoder focus on relevant parts of the input sentence.
- Feed forward
From here, the tensors that are sent between the different layers would come into play. You can go into much more detail by following Jay Alammar’s post where he explains the architecture perfectly.
What is Trax?
Trax is a “high level” deep learning library that uses TensorFlow as its back-end. I would compare it to Keras.
We are going to use Trax because Google is clearly betting on this library, it is still in its 1.3.6 version (at the time of writing this post) and it is a fast and easy to use library.
Working with TensorFlow as a back end has many advantages when putting the models into production (where TensorFLow is the king of AI frameworks for companies) as, among other things, it allows the model conversion into a Keras format).
Another advantage is to use the TFDS (TensorFlow DataSets) as we will see in this post.
To start using Trax, we just need to install it:
pip install –q –U trax
From this moment we just need to import the library to use it:
import trax
You can also import only what you need:
from trax import layers as tl from trax.fastmath import numpy as fastnp from trax.supervised import training from trax import data from trax import models
*Pro tip:
Install Jax dependency with GPU support, as CPU is installed by default.
Data
Where do we get data from?
To train a model like this one we need data and, as we explained before, we will take advantage of TFDS benefits. More specifically, we will go to the translate section where we will choose para_crawl, which uses the 1.2 ParaCrawl version (with almost 22 million sentences in English with their correspondent Spanish ones).
We will take them from here but we could also take the latest version straight from the ParaCrawl web, that uses the 7.1 version and that has nearly 79 million sentences in English with their Spanish pairs. Warning! This will download about 6 GB of data.
# Creamos el generador con los datos para entrenamiento train_stream_fn = data.TFDS('para_crawl/enes', keys=('en', 'es'), eval_holdout_size=0.01, # 1% para validación train=True) # Creamos el generador con los datos para validación eval_stream_fn = trax.data.TFDS('para_crawl/enes', keys=('en', 'es'), eval_holdout_size=0.01, # 1% para validación train=False) train_stream = train_stream_fn() eval_stream = eval_stream_fn()
How do we treat/transform data?
To work with these data we will need a subword type vocabulary, which is how Google engineers put it for English / German.
We will need to generate it by ourselves, trying to create a vocabulary of about 32,000 tokens. To do this, the first thing to do is extract the information from the dataset that we will previously download to a text file, and then we will intermix the English sentences with the Spanish ones so that it takes a vocabulary from both languages.
i = 0 with open('data/train.txt', 'w') as f: for text_en, text_es in train_stream: f.write(str(text_en, 'utf-8') + '\n') f.write(str(text_es, 'utf-8') + '\n') if i == 2000000: sys.exit("Fin") i += 1
*Pro Tip:
We put a limit of two million for two reasons: the first one is because there are 22 million and we do not need all of them to make a vocabulary of 32,000 tokens; the second reason is that the generator is infinite and if you don’t stop it, it would be pouring data non-stop.
The next step is to create the vocabulary using a script from Tensor2Tensor to Trax.
We clone the repository locally:
git clone https://github.com/google/trax.git
We execute the command:
python trax/data/text_encoder_build_subword.py \ --corpus_filepattern=data/train.txt --corpus_max_lines=40000 \ --output_filename=data/paracrawl.subword
*Pro tips:
- There are several variables you will have to play with to get 32-35 thousand tokens. One is the
corpus_max_lines
, to which you indicate how many of those two million lines you are going to use. You could also use anothermin_count
: by default it is 5 and this specifies the minimum number of times the token has to come out. - If your script fails, be sure to pass the main in
app.run
.
Now that we have the vocabulary, we can now encode the words as numbers, that is, tokenize the dataset:
# Tokenizar el dataset. tokenized_train_stream = data.Tokenize(vocab_file='paracrawl.subword', vocab_dir='data/')(train_stream) tokenized_eval_stream = data.Tokenize(vocab_file='paracrawl.subword', vocab_dir='data/')(eval_stream)
Next step will be to define the EOS (End Of Sentence). In this case we will use a 1 and create a function that adds the EOS to the pipeline, and then we will apply it:
# Añadimos EOS a los datos tokenized_train_stream = append_eos(tokenized_train_stream) tokenized_eval_stream = append_eos(tokenized_eval_stream)
We must also filter these sentences by size (number of tokens), in case someone happens to enter the entire Don Quixote, as if it were one sentence only. Let’s limit the training dataset to 512 tokens per sentence and the validation dataset to 1024:
filtered_train_stream = data.FilterByLength( max_length=512, length_keys=[0, 1])(tokenized_train_stream) filtered_eval_stream = data.FilterByLength( max_length=1024, length_keys=[0, 1])(tokenized_eval_stream)
Note: length_keys
is used to indicate on which keys this filtering is applied. Here it is done on English and Spanish (en, es).
Now it is time to divide between batches and buckets, creating this way the organized generator (Each batch_size[i]
corresponds to its bucket or boundaries[i]
, this way it is more efficient in memory):
boundaries = [8, 16, 32, 64, 128, 256, 512, 2048] batch_sizes = [256, 128, 64, 32, 16, 8, 4, 2] train_batch_stream = data.BucketByLength( boundaries batch_sizes, length_keys[0, 1])(filtered_train_stream) eval_batch_stream = data.BucketByLength( boundaries batch_sizes, length_keys[0, 1])(filtered_eval_stream)
Now it’s time to add the padding to the sentences including the <PAD> token until each bucket is filled. In this case, the token is 0:
train_batch_stream = data.AddLossWeights(id_to_mask=0)(train_batch_stream) eval_batch_stream = data.AddLossWeights(id_to_mask=0)(eval_batch_stream)
We will now define the Transformer model for which we will use the same hyper-parameters that were used to train the English-German model, and that we can observe in Github, when they make an example of inference.
model = models.Transformer( input_vocab_size=33300, d_model=512, d_ff=2048, dropout=0.1, n_heads=8, n_encoder_layers=6, n_decoder_layers=6, max_len=2048, mode='train')
In the field input_vocab_size
(vocabulary size) you can put the number of tokens you have or less, but then you should use the same amount to make the prediction. We explain these parameters:
input_vocab_size
– Vocabulary sized_model
– Dimension of various points in the architecture, such as the output of the embedding layerd_ff
– FeedForward Dense Layer Size (both Encoder and Decoder)dropout
– Probability of discarding trigger values in neural networks within encoders / decodersn_heads
– Number of “attention heads”n_encoder_layers
– Number of encoder layersn_decoder_layers
– Number of decoder layersmax_len
– Maximum symbol in PE (Positional Encoding), See paper (3.5 section) for more informationmode
– Here we indicate if it is for ‘train‘, ‘eval‘ or ‘predict‘. We will use ‘predict‘ for inference
The next step will be to create the training and validation tasks. Here you can play with the hyper-parameters. In my case, I have used these:
train_task = training.TrainTask( labeled_data=train_batch_stream, loss_layertl.CrossEntropyLoss(), optimizertrax.optimizers.Adam(0.01), lr_scheduletrax.lr.warmup_and_rsqrt_decay(n_warmup_steps=20000, max_value=0.001), n_steps_per_checkpoint=20000) eval_task = training.EvalTask( labeled_dataeval_batch_stream, metrics[tl.CrossEntropyLoss(), tl.Accuracy()], n_eval_batches=20)
Now we execute the tasks in a loop, with a number of steps that we will define:
training_loop = training.Loop(model, train_task eval_tasks=[eval_task], output_dir=output_dir) epochs = 500000 training_loop.run(epochs)
Something that we could add to this loop is a callback, to end the training when the model does not improve.
Results
In this post you have almost everything to build a very good translation model. I leave to your imagination how the inference is made, but here I paste the result of a test:
- Original sentence:
The good news is that puppies sleep a lot, although they do not always sleep through the night, and your pup may wake the household whining and barking to express his displeasure at being left alone
- Tokenized sentence:
[ 40 378 3061 21 30 5024 20386 25 7354 6 1352 2 2932 93 68 65 557 7354 222 5 1136 2 11 59 5024 477 233 11770 5 12232 9829 2239 11 22105 77 13 5344 101 21584 4837 2089 72 332 813 2859 1]
- Prediction:
[ 60 1299 13497 25 15 19 22015 20535 17 33267 30482 9 576 2 1066 41 435 33267 30482 9 239 7 1008 2 10 37 22015 20535 9 124 8093 45 3163 15 17800 9 10 17575 6 24 17921 37 21586 14733 29 45 132 7556 568]
- Detokenized translation:
La buena noticia es que los cachorros duermen mucho, aunque no siempre duermen durante la noche, y su cachorro puede despertar al hogar que grita y ladra para expresar su disgusto al ser dejado solo