Natural Language Processing: From one-hot vectors to billion parameter models*fQsTJufvnacFpNdQ

Original Source Here

The Word2Vec algorithm makes two choices, as briefly mentioned above:

First, it captures local relationships. Secondly, it treats words as basic units. Other algorithms take other choices. For example, the GloVe algorithm, short for Global Vectors, creates embeddings based on global relationships and uses words as basic units. A third algorithm is FastText (see paper references here), which uses sub-word information as basic units (e.g. “this” -> “th” and “is”).

A few challenges exist, regardless of the algorithm choice: How to handle punctuation? How to handle plural words? Different word forms? Such modifications are part of the pre-processing, which often includes stemming and lemmatization. Stemming reduces a word to its stem. This technique is efficient but can create artificial word stems. Lemmatization tries to reduce a word to its base form, which is often achieved with large databases.

With the advance of modern end-to-end approaches, the importance of pre-processing has slightly become less. Only tokenization remains relevant, such as Byte-Pair encodings. Before we get there, we have to take a look at n-grams.

(Modern) n-gramns

To begin incorporating word order, we can use a so-called n-gram approach. Previously, when creating the BoW representation, we split the sentence into single words. Instead, we could also divide it into segments of two successive words, three successive words, and so on. The number of words in a row is the n in n-gram. If n is one, the parsing model is called unigram; if n is two, it is called bigram.

As an example, let’s parse “The quick brown fox jumps over the fence” with a bigram model:

[“The quick”, “quick brown”, “brown fox”, “fox jumps”, …]

As we see in this example, we slide a window over the text, with a size n and a stride of one. The stride of one causes the last word of the previous extract to be the first word of the following excerpt. Using this approach, we partially keep the relationship between words.

The representation is static. If the n parameter changes, the dataset has to be pre-processed again. Depending on the size, this can take a while. However, we can incorporate the n-gram algorithm into neural networks and let them do the “pre-processing.” This approach can be seen as modern n-grams.

From the embedding step, we use the obtained dense and informative representation vectors for every single word. To incorporate word order, we have to alter the network input slightly. Given a sentence of length l and embeddings of size e, the input is now a matrix with shape:

l x e

This matrix can be interpreted as a strange image. In image-related tasks, convolutional neural networks are highly effective, but they are not restricted to this domain. A convolution operation going over the input text-matrix can be interpreted as creating n-grams. When the kernel size is set to 2 and the stride to 1, this essentially creates a bi-gram representation. Keeping the stride but increasing the kernel’s size to 3, we obtain a tri-gram model.

The convolution operation has some beneficial properties. Among them is the location invariance. A kernel can be applied at every position of the input. If one such kernel focuses on adjective-noun combinations, it will detect them at any place. Additionally, convolution operations naturally share their weights and have significantly fewer parameters than dense layers.

The drawback of n-grams is that they only capture relationships over a limited range. For example, N-grams can not express the connection between the first and last word of a sentence. Setting the parameter n to the length of a sentence would work but would require a separate n for each differently-sized sentence. There is a better approach to handle relationships over a large span, the attention mechanism. This technique was introduced for machine translation tasks.

Machine Translation

The Machine Translation (MT) task deals with translating text from a source to a target language. For example, “Est-ce que les poissons boivent de l’eau?” (French) would be translated to “Do fishes trink water?”

Early MT research was done during the Cold War and aimed to translate Russian to English. The systems in use were mainly based on rules and required dictionaries. In the next step, the word order was corrected.

From the 1990s on, statistical MT became the focus. The core idea is to learn a probabilistic model. With an emphasis on translational tasks, this means the following: Given a sentence in a source language, we need to find the most proper sentence in the target language — i.e., the best translation.

Mathematically, this is expressed by y* = argmax_y P(y|x). The y* denotes the translation with the highest probability.

With the rules of Bayes, this description can be broken down into two separate parts: y* = argmax_y P(x|y) P(y). The first component, P(x|y), models the probability of x being the translation of y. This part is called the translational model. The second part, the language model, gives the probability of y in the target language.

In other words, the first part quantifies the probability of the source sequence, and the second part quantifies the probability of the target sequence. The translational model is learned with large text datasets in the source and target language(s).

The advance of deep neural networks also influenced the task of machine translation. Recurrent Neural Networks, RNNs, are a good choice since both the input and output are sequences. The input is the sentence to be translated; the output is the translation. Two problems emerge with a single RNN: First, the source and target sentence lengths can be different. This fact is evident in the initial example translation. The input contains eight tokens, and the output only five (depending on how you count punctuation, but the problem stays the same). Secondly, the word order can change, rendering a single one-in, one-out RNN unusable.

The solution to both problems is using two recurrent networks in a so-called Sequence-to-Sequence (Seq2Seq) layout [2,3]. The first network, named encoder, takes the input and yields an internal representation. The second network, called decoder, uses this internal representation to generate the target sequence, e.g., the translation.

A short note: The internal representation can model any knowledge about the input sentences, which includes word relationships. The problem of long-time dependencies is thus minimized.

A single RNN reads the input token-wise parsing and immediately returns a translation. In the Seq2Seq layout, the encoder reads the complete input sequence first to create an internal representation. Only then is this representation, which usually is a vector, passed to the decoder network.

This setup is trained text pairs, and the probability of the generated target sequence is seen as the loss. During training, the model, which is the combination of the encoder and decoder, maximizes the probability of appropriate translations.

This setup is all that is required to handle Seq2Seq tasks. However, as usual, one can make a couple of improvements, among them the attention mechanism.


In the vanilla sequence-to-sequence layout, the encoder compresses all information about the input sentence into a single vector. This vector is fittingly called the bottleneck layer, as it causes an information bottleneck. Despite the ability to parse endless sequences, the information from earlier steps will slowly fade as new information gets added.

The solution to this bottleneck is making all hidden encoder states accessible. In the vanilla setting, the encoder only delivers its last state, which becomes the bottleneck. This approach can be modified to save all intermediate states after the network parsed a new token. As a result, the encoder can store the per-step information and no longer compresses all knowledge into a single vector.

In the encoding step, the encoder network has access to all the per-step vectors. As a result, it can now pay attention to the most important hidden states. I’ll briefly go over the mechanism that achieves this [4].

The encoder saves all hidden states (the vectors after it parsed a new token). We then take the encoder’s current state and compute the dot-product between it and the hidden states. This operation returns a scalar number for each hidden state, which we can represent as a new vector:

[a, b, c, d, e],

where the characters represent the result of the dot product with the corresponding hidden state (e.g., a is the result of the dot product with the first hidden state). The softmax operation is then applied to this vector, yielding the attention weights. These weights express the importance of each hidden state, represented by a probability. In the last step, we multiply the hidden states with their attention score and sum the vectors up. This gives us the context vector for the current state, which is passed to the encoder (not completely right, but fine for this purpose). This process is repeated after each encoder step.

The process can be seen in this animation, taken from Google’s seq2seq repository:

Animation of the Sequence-to-Sequence process. Bold lines indicate the attention on specific encoder states. Animation from Google’s seq2seq repository.

An animation of the attention mechanism. From Google’s seq2seq.

The attention process enables the encoder to store the per-step information and lets the decoder decide which hidden states to pay attention to. These modifications have greatly improved the quality of text translation, as shown in [4].

A general drawback of recurrent networks in general, and the sequence-to-sequence approach in particular, is the low parallelizability. Recall that the output of step k+1 depends on step k; we thus have to parse all previous steps first. Therefore, we cannot run much in parallel here — except if we change the underlying network architectures. This idea leads us to Transformers.


The Transformer [5] is a neural network architecture that utilizes the attention mechanism. Under the hood, it still uses an encoder-decoder structure but replaces the recurrent networks. Instead, the encoder is modelled by n identical layers with self-attention and regular feed-forward networks. An encoder block uses the same structure and adds another attention layer that takes the encoder’s output. The following figure, taken from [5], shows the described setup:

Schematic view of the Transformer architecture. On the left the encoder stack, on the right the decoder stack. Image from [5].

A second improvement is the modification of the attention procedure. In the vanilla sequence-to-sequence approach, the hidden state of the decoder calculates the dot-product with all encoder hidden states. In the improved mechanism, called Multi-Head Self-Attention, these operations are modelled as matrix multiplications.

The self-attention transforms the input to an internal representation, a weighted sum of its own timesteps. This approach can capture long-term dependencies within the sequence. The input is converted into three different representations, the Key, the Query, and the Value, to model these relationships. These representations are obtained by multiplying the input with three weights: Wₖ (for the Key), Wᵥ (Value), and Wq (Query). The computation flow is shown in the following figure [5]:

Detailed view of the attention mechanism. Q, K, and V are different representations of the same input. Image from [5].

Q and K are matrix-multiplied, scaled, optionally masked, and then softmax-ed. Finally, the result is matrix-multiplied with V. This can mathematically be expressed in the following equation:

Representation of the attention mechanism as an equation. Image from [5]

Finally, there are multiple such “flows,” dubbed attention heads. Each head uses a different set of attention weights Wₖ, Wᵥ, and Wq. These weights yield multiple internal representations for the same input. The result of the individual attention heads are then concatenated, as shown in the following figure [5]:

Schematic visualization of the Multi-Head Attention mechanism. The individual vectors are concatenated before being fed into a linear layer. Image from [5].

These computations are done in the encoder and decoder blocks and can be run in parallel. This parallelization is one of the reasons why the Transformer approach is faster than vanilla RNN models. Further, the Transformer uses Byte Pair encoding, enabling him to generalize to unseen words. Rare words are a common problem, as are languages with compounding. Think of Germany’s crazy nouns such as “Baumhausprüfgesellschaft” (“Society that checks treehouses”). I highly doubt that such combinations are present in typical training data. So the idea is to use sub-word tokens, the aforementioned Byte Pair encoding: We extract the most frequent sub-words from the training data, such as “th” and “ng.” With such a vocabulary, we can then model an (unknown) word as a combination of the Byte Pairs.

To model word relationships, we have to preserve the information about the input order. The simple idea is to modify the tokens’ embedding vectors. We add a signal to each embedding that models the distance between vectors, the step-wise information.

More details could be covered here, but this will lead away from this post’s theme. If you are interested, I recommend Jay Alammar’s description, “The Illustrated Transformer.”

After the progress covered so far, one might get the impression that natural language can now be parsed fairly easily. This is not the case. A couple of problems persist. Coming back to word embeddings, we can now use the attention mechanism to model word relationships. However, consider the sentence “He opened a bank.” From the context, it becomes clear that the “bank” is the financial institution. Now consider “She sat on the bank.” Again, the context information explains that the “bank” is the thing used to sit on.

Word2Vec or similar frameworks only create one vector per token. However, as we have noticed, the meaning of a word is determined by its context. And the frequent context might vary between the training data (used to obtain the embeddings) and the test data. So how can we solve this? By using transfer learning.

Transfer Learning, Language Models and ELMo

In computer vision tasks, transfer learning is a standard procedure. First, learn a model on task A, then use the weights for B. Usually, tasks A and B have been trained on the same data types. Further, the problems being solved (object detection, classification, translation) are sufficiently similar. This concept can be expanded to the NLP domain.

We pretrain a model on an extensive dataset, modify the model, and reuse it for our task. In the pretraining step, the model has already learned linguistic features, such as syntax or semantics. The advantage is the decreased need for data for the new task.

A standard pretraining task is language modelling. When we train a language model, we predict the upcoming word based on the previous ones. For this task, ELMo [6] becomes relevant. ELMo, short for Embeddings from Language Models, consists of LSTM layers. Several of these layers are stacked and trained to predict the upcoming word. Additionally, the model goes one step further: it is a bi-directional language model. The sequences are parsed in both directions. To obtain an embedding vector, we concatenate the hidden layers and sum them up (simplified), as shown in this figure, taken from Jay Alammar’s introduction to BERT and co.:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: