Neural Machine Translation with TRANSFORMERS

Original Source Here

Neural Machine Translation with TRANSFORMERS

This Project on Neural Machine Translation is presented to you by

It’s part of our Deep Learning with TensorFlow 2 course

Take your Time 
en Joy every moment

Table of Contents

  • Task
  • Data
  • Modeling
  • Error Sanctioning
  • Training, Validation and Optimization
  • Performance Measurement
  • Testing
  • Corrective Measure


Neural Machine Translation entails using Neural networks in doing Machine Translation automatically.


Input (English): I love Deep learning
Output (French): J’aime l’apprentissage approfondi


Recall that Machine learning involves training a model using data. In our case out data is made of input sentences in English and their corresponding translations in French (which are the outputs). The data we shall use can be downloaded via this link: It contains translations from English to French.

Here is what it looks like:

Our next step will be to use TensorFlow 2 to prepare this data, for training our Transformer Model.

+ Imports

As we go on, we shall understand the need for each of these imports

Text Data is vectorized before passed into a Machine Learning Model. In our case we shall be using word vectorization. We want each word in both the input and output to be represented as a one-hot vector
Ex: If we have a vocabulary (list of words our model sees as words) of words:

 Vocabulary = {<PAD>, I, am, the, happy,people, met,laughing}

And we want to pass the sentence ‘I met happy people laughing’ into our model, we get:

I becomes: [0,1,0,0,0,0,0,0]
met becomes: [0,0,0,0,0,0,1,0]
happy: [0,0,0,0,1,0,0,0]
people: [0,0,0,0,0,1,0,0]
laughing: [0,0,0,0,0,0,0,1]

Notice how each vectors contains all zeros except a one(1) at the position the word occupies in the vocabulary.

  While you see:  I    met    happy  people laughing |0|   |0|     |0|    |0|    |0|
|1| |0| |0| |0| |0|
|0| |0| |0| |0| |0|
|0| |0| |0| |0| |0|
|0| |0| |1| |0| |0|
|0| |0| |0| |1| |0|
|0| |1| |0| |0| |0|
|0| |0| |0| |0| |1|
Here is what the model sees: A 5 x 8 matrix(5 words, vocab_size = 8)DIMENSION: BATCHSIZE x SENTENCE_LENGTH x VOCABSIZE = 1 x 5 X 8

Good News:

TensorFlow makes this process very easy with its TextVectorization Layer which we imported earlier.
With TensorFlow, you don’t even need to convert each word into a one-hot vector. All you need is the position of each word in the vocabulary:

While you see :‘I met happy people laughing’
The model sees:[1, 6, 4, 5, 7]

Another example with a batch of 3 sentences:

While you see : 'I met happy people laughing’
‘happy people’
‘I am happy’
The model sees:[[1, 6, 4, 5, 7],
[4, 5],
[1, 2, 4]]
Adding padding to make all sentences have same length. Notice that the padding (<PAD>) occupies the zeroth position in the vocabulary.
DIMENSION: BATCH_SIZE x SENTENCE_LENGTH = 3 x 10 (unlike the one-hot notation which would have yielded a dimension of 3 x 10 x 8

+ Locate the Dataset

+ Dataset Extraction

Extract the Dataset using TextLineDataset which is part of
TextLineDataset reads data from a text file line by line and each line is a data point in the Dataset.

Let’s visualize 1 data point in this Dataset. To view more data points, simple change the ‘1’, with ‘number_of_datapoints’.

>> Output:
tf.Tensor(b'Go.\tVa!\tCC-BY 2.0 (France) Attribution: #2877272 (CM) & #1158250 (Wittydev)', shape=(), dtype=string)

Let’s break this datapoints up, into the inputs and outputs. It turns out that we’ll have 2 inputs and 1 output per Data point.
Why??? Teacher Forcing.
In Machine Translation we use teacher forcing to speed up the training process.

What is teacher forcing?
This is a training strategy, used generally in Seq to Seq(Seq2Seq or encoder-decoder) models (like with translation where we have an input sequence(English text) and an output sequence (French text))

Instead of:

Extract from Deep Learning Course by

We’ll use this:

Extract from Deep Learning Course by

Notice the inclusion of the starttoken(start). With teacher forcing the output is fed back as an input in the decoder section of the encoder-decoder model(Seq2Seq).

Let’s use TensorFlow, to split our Dataset using the split method from tf.strings.

We have the following function (selector), which breaks the data points into English inputs, shifted French inputs(since we include a start token) and French outputs

A small note on how tf.strings.split works

tf.strings.split(“I’m free.\tJe suis libre.\tCC-BY 2.0 (France) Attribution: #23959 (CK) & #6725 (sacredceltic)”, ‘\t’)>> Will Output:
[“I’m free”, “Je suis libre” , “CC-BY 2.0 (France) Attribution: #23959 (CK) & #6725 (sacredceltic)”]

That’s why when the selector method gets an input, it returns 3 elements:

1- The 1st input (which is the English sentence) by selecting the index [0:1] 
2- The 2nd input (which is the shifted French sentence) by selecting the index [1:2]. The shifted French sentence is gotten thanks to the addition of the ‘starttoken’, before the French sentence
3- The output (which is the French sentence) by selecting the index [1:2]

We now map the Dataset to its new values, using the map method of

Let’s now see what our Dataset looks like:

>> Output:
(<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken Va!'], dtype=object)>, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Va!'], dtype=object)>)

Everything seems to be set, but we need to vectorize the data.

  • First create a vocabulary of words, using all the words in our Dataset
  • Then vectorize the data based on this vocabulary(just like what we did previously).

Oops! We also need to clean the data.

+ Data cleaning

  • Send all words to their lowercase form(‘HAPPY’, ‘Happy’, ‘happy’, ‘HAPPy’, and ‘haPPY’ not shouldn’t represent two different words in the vocabulary; so we just convert all words to their lowercase form, to avoid such confusion).
  • Replace HTML tags with ‘’ (Many times, data is scraped from web pages and this data contains many HTML Tags; so we have to take them off since they provide no added value).
  • Remove all punctuation (Although you may build a vocabulary with the punctuation).
  • Replace double spaces with a single space.

Data is clean 🙂 , back to vocabulary creation and vectorization

Here again, TensorFlow intervenes with the TextVectorization layer (which can even be inserted directly into a TensorFlow Model)

What is the exact role of a TensorFlow’s Vectorizer?

  • Gathers all words in the Dataset and automatically creates a vocabulary.
  • Maps each word in this vocabulary to an index
    What can TensorFlow Vectorizer do apart from this?
  • Preprocess the sentences.
  • automatic padding when the number of words in a sentence is less than(<) the sequence_length (sentence_length).
  • automatic trimming when the number of words in a sentence is greater than(>) the sequence_length.
  • And other stuff you could check in the documentation.

Thank you TensorFlow❤️.

Below, we define the Vectorizers we shall be using (The Vectorizer takes as input the method preprocess_sentences and the sequence_length).

Note that you could decide that the input English sentences will have a different sequence length from the French sentences (both input shifted and output)

The Vectorizers have been defined, its time to:

  • Gather all words in the Dataset and automatically create a vocabulary.
  • Maps each word in this vocabulary to an index.

Note: You could also define a separate function (to replace the lambda function), to do the selection of the part of the data to be vectorized

Let’s take 1 example:

Let’s now see what this vocabulary which has been adapted to our Dataset looks like

>> Output:
['', '[UNK]', 'i', 'it', 'm', 'tom', 's', 'go', 'get', 'we', 'you', 'me', 'up',...
>> Output:
['', '[UNK]', 'starttoken', 'je', 'est', 'suis', 'tom', 'j', 'nous', 'ai', 'le',...
>> Output:
['', '[UNK]', 'je', 'est', 'suis', 'tom', 'j', 'nous', 'ai', 'le',...
>> Output:
639: words
>> Output:
1414: words
>> Output:
1413: words

Last But not the least step
Map the Dataset (each sentence to a corresponding vector based on the position of the words it contains in the vocabulary)

Note: When working with 2 or more inputs or outputs, use a dictionary or list. In our case, we have 2 inputs, which we shall put in a dictionary.

We add the squeeze method, on the zeroth (0th) axis.

Here is what our Dataset looks like, when we take out 2 data points

{'in1': <tf.Tensor: shape=(10,), dtype=int64, numpy=array([7, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)>, 'in2': <tf.Tensor: shape=(10,), dtype=int64, numpy=array([ 2, 29,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int64)>} tf.Tensor([28  0  0  0  0  0  0  0  0  0], shape=(10,), dtype=int64){'in1': <tf.Tensor: shape=(10,), dtype=int64, numpy=array([7, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)>, 'in2': <tf.Tensor: shape=(10,), dtype=int64, numpy=array([  2, 201,   0,   0,   0,   0,   0,   0,   0,   0], dtype=int64)>} tf.Tensor([200   0   0   0   0   0   0   0   0   0], shape=(10,), dtype=int64)

+ Shuffling, creation of val set, caching, batching, pre-fetching

Notice the validation Dataset is gotten by taking the 1st validation bridge examples and the train Dataset is gotten by skipping the 1st validation bridge examples(same as taking the last examples)

0 --------------—--------> VALIDATION_BRIDGE -------> TOTAL_DATASET
<---------TRAIN----------> <--VAL-->

We finally, create batches, do caching and pre-fetching to speed up training.

Model — Attention is all you( we 🙂 ) need.

Pre-requisites: Understanding the attention mechanism.(See

Find paper here

A transformer model is made of and Encoder and a Decoder.

To better understand this schematic, consider the following slide adapted from


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: