Original Source Here
A literature review on Statistical Machine Translation — SMT and Neural Machine Translation — NMT: Encoder-Decoder structure with and without attention mechanism and state-of-the-art Transformer architecture.
Machine Translation plays a vital role in today’s digitized and globalized world. It benefits society by processing and translating one natural language into some other natural language. With advances in technology, there is an enormous amount of information being exchanged between different regions with different languages. This boosts the demand for Machine Translation to grow exponentially over the last few decades. Owing to that, Machine Translation has become an active research area over the past couple of years. It can be divided into three distinct approaches: rule-based approaches, statistical approaches, and neural approaches. In this article, I will mainly focus on the statistical and neural approaches.
Machine Translation is the task of translating a sentence from one language (the source language) to a sentence in another language (the target language). It is the sub-field of computational linguistics that aims to utilize computing devices to automatically translate text from one language to another. Machine Translation research began in the early 1950s (Cold War period). During that time, there is a need to translate Russian documents into English. Since there are not many Russian language experts and it is very time consuming to translate, therefore Machine Translation is targeted to be applied. The systems developed at that time were mostly rule-based, using a bilingual dictionary to map Russian words to their English corresponding words. Even though it did not work very well, it gave way to statistical-based systems in the late 1980s. In the 1990s, the statistical word-based and phrase-based approaches that required little to no linguistic information became popular. The core idea of Statistical Machine Translation (SMT) is to learn a probabilistic model from data. In the 2010s, with the advent of deep neural networks, Neural Machine Translation became a major area of research. NMT is a way to do Machine Translation using deep neural networks. The neural architecture is called sequence-to-sequence (seq2seq). The vanilla seq2seq NMT involves two Recurrent Neural Networks (RNN) . NMT research has pioneered many of the recent innovations of natural language processing (NLP) and researchers have found many improvements to the vanilla seq2seq and one major improvement is the use of attention mechanism . Motivated by the attention mechanism, the paper “Attention Is All You Need” introduces a novel architecture called Transformer which is now the state-of-the-art language model . This architecture relies entirely on attention-mechanism without any RNN and its variants such as BERT has been applied to many NLP tasks and are able to achieve state-of-the-art performance.
Statistical Machine Translation
Statistical Machine Translation (SMT) learns a probabilistic model from data. Suppose if we are translating from German to English, we want to find the best English sentence y, given German sentence x. SMT will formulate the task as follow:
It means that among all possible y, we want to find the best one. By using Bayes Rule, we can convert the above formula into below:
P(x|y) is called the translation model which is to model how words and phrases should be translated using parallel data. An example of parallel data is pairs of human translated German-English sentences. P(x|y) is further broken down into P(x, a|y) where a is the word alignment i.e., word-level and phrase-level correspondence between source sentence x and target sentence y.
P(y) is called the language model which is to model the probability of generating strings in a language using monolingual data. A language model is a function that puts a probability measure over strings drawn from a sequence of vocabulary. Given a string y of length n, we can derive the language model probability P(y) as:
However, it is inefficient to calculate the probability of a word given its entire history, we can approximate it by using the n-gram model. In the n-gram model, it makes a Markov assumption that yᵢ will only depend on the preceding n-1 words.
To compute the argmax, we could enumerate every possible translation y and calculate the probability, however, it is computationally expensive. So, it uses decoding, a heuristic search algorithm, to search for the best translation by removing hypotheses with low probability. This is a brief overview of how SMT works. The best SMT systems were extremely complex, and many important details were not covered here. SMT is expensive and time-consuming to develop because it needs lots of feature engineering and human effort to maintain. 
Neural Machine Translation
Deep neural networks have achieved state-of-the-art performance in various applications. Along the line of research on using neural networks for SMT, Neural Machine Translation (NMT) became the major area of research. It uses a single end to end neural network architecture called sequence-to-sequence (seq2seq) which involves two RNNs: an encoder RNN and a decoder RNN. Encoder RNN will summarize the source sequence with an encoding vector, and decoder RNN will generate the target sentence which is conditioned on the previous encoding vector. The seq2seq model is a conditional language model which directly calculates P(y|x) because decoder RNN is predicting the next word of the target sentence y by conditioning on the source sentence x. The seq2seq model can also be used for many other natural language processing tasks such as summarization, dialogue chatbot and so on.
In vanilla seq2seq as illustrated in Figure 2, encoder RNN (blue blocks) analyzes the input sentence in the source language, and it encodes the input sequence with a history vector called hidden state vector. The last hidden state or encoding vector is passed to decoder RNN (red blocks) as the initial hidden state. The decoder initial hidden state together with <eos> of the source sentence will generate a hidden state which will then pass to the linear layer.
The linear layer followed by Softmax will output the probabilistic probability distribution for the whole vocabulary of the target language. From that probability distribution, it will choose the token with the highest probability as the first word i.e., X and it will be used as the second input of the decoding. The second hidden state from the previous step and the first generated word X will be input to the second step of decoder RNN. And the same process will be repeated until it produces a <eos> token. The sequence of tokens generated from decoder RNN will be the result of the seq2seq model. 
The advantages of NMT compared to SMT is that it has better performance and requires much less human effort. However, it is less interpretable, hard to debug and difficult to control.
Seq2Seq with Attention
As mentioned before in the vanilla seq2seq model, the last hidden state of encoder RNN is being used as the initial state of decoder RNN which means all the information about the source sentence will be encoded as a single vector and it is the only information from the source sentence to decode the target sentence. Therefore, the last hidden state of encoder RNN can become the information bottleneck because it has to capture all information about the source sentence just with a single vector. Attention mechanism has been used to solve this problem by selectively focusing on parts of the source sentence during translation. The core idea is that on each step of the decoder, it uses a direct connection to the encoder to get the weighted attention of the source sequence. Figure 3 illustrates how the attention layer is added to a vanilla seq2seq model. At each time step t in the decoding process, the attention layer will derive the context vector cₜ that captures relevant source sequence information to help predict the current target word yₜ.
The attention distribution, aₜ, is calculated by using the current target hidden state hₜ with all hidden states of encoder ℎ̅ₛ. To get the attention distribution, it will first calculate the attention score then apply Softmax to turn the scores into a probability distribution. There are three different ways to calculate attention scores:
By using this alignment vector, aₜ, as weights, the context vector, cₜ, can be computed as the weighted average over all the encoder hidden states. The context vector or attention output will mostly contain information from the encoder’s hidden states that received high attention scores. This context vector will then concatenate with the decoder’s hidden state which will then select the target language token with the highest probability as in the vanilla seq2seq model. Sometimes, the context vector from the previous step can be used to feed into the decoder along with the usual decoder input.
The attention mechanism significantly improves NMT performance and becomes the key mathematical structure for NLP research. With attention, each word in the target sequence only needs to find its match with few words in the source sequence. It somehow solves the limitation of long-term dependencies in RNN where any word in the target sequence communicates with all words in the source sequence. Moreover, the attention mechanism provides some interpretability because the network is trained to learn the soft alignment by itself and we can see what the decoder was focusing on by inspecting the attention distribution.
There are still many difficulties remaining in Machine Translation tasks such as out-of-vocabulary issues, domain mismatch between train and test data, maintaining context over long sequences, low resource language pairs where labels data are not much available.
Based on the best performing seq2seq model with attention, the paper “Attention Is All You Need” proposed a new architecture called Transformer in 2017 as illustrated in Figure 4. The main idea is basically to only use attention as the representation learning since in the previous seq2seq attention model, it proved that attention between encoder and decoder is crucial in NMT.
I will start with the main concept of self-attention that is used for representation learning in Transformer. Consider Q, K, V are the word embedding vectors. Q matrix (query) is the vector representation of one word in the sequence, K matrix (keys) is the vector representation of all words in the sequence and V matrix (values) is the vector representation of all words in the sequence.
Instead of using a single attention function, the paper proposed to use “multi-head attention” to linearly project Q, K, V h times with different learnable weights. This provides multiple representation subspace and allows the model to focus on different positions.
Unlike RNN, self-attention in Transformer is order invariant which means it does not capture sequence information. Therefore, “positional encoding” is added to inject the positional information of each token using the sine function. This solution provides deterministic ways to incorporate sequence information without increasing the number of learnable parameters.
In the Transformer encoder, the encoder’s input embedding after injecting positional encoding will first flow through the multi-head attention layer which output will then feed into a feedforward neural network. In the Transformer decoder, there are 2 attention layers called masked multi-head attention and encoder-decoder attention. Masked multi-head decoder self-attention on previously generated outputs by masking future words. Encoder-decoder attention uses queries from masked multi-head; keys and values from the output of the encoder. As usual, the decoder output will then pass to the linear layer and Softmax function to get the probability distribution on target tokens. Each sub-layer between attention and feedforward layer has a residual connection (skip connection) which is followed by a layer normalization to change the input to have mean 0 and 1 per layer.
Transformer replaces the sequential computation used in previous seq2seq RNN models by providing the pairwise parallelized multiplicative interaction (self-attention) which only need one computation. This architecture is the current state-of-the-art architecture that provides high performance with interpretability.
The most frequent measure which is used to evaluate Machine Translation is called Bilingual Evaluation Understudy (BLEU). It compares the machine written translation to one or several human written translations and computes similarity scores based on n-gram usually 1, 2, 3, and 4-grams precision. BLEU is very useful and most of the machine translation systems are using BLEU to evaluate. However, it is not perfect because there are various ways to translate a sentence. And a good translation can get a poor BLEU score because it has a low n-gram overlap with the human translation.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot