Journey to BERT : Part 2

Original Source Here

The paper suggested combing the CoVe vectors (last layer of the bi-LSTM encoder) with Glove vectors and illustrated performance gains in some common nlp tasks. Unlike Tag-LM however, CoVe needed labelled data(pair of texts in 2 languages) for training the encoder on Machine translation task. And that’s an obvious limitation of this approach. Plus, the actual performance gains in down-stream task is more dependent on the architecture of the downstream task.


Embeddings from Language model (ELMO) in some sense is a refinement over Tag-LM from the same group(Peters et al.). The authors suggest that a (bi-directional)language model learnt in an un-supervised way over a large corpus carries both, semantic as well as syntactic connotation of words. Initial layers of the model captures the syntactic meaning(NER, POS Tagging) while the end layers of the model capture the semantic meaning( sentiment analysis, question answering, semantic similarity etc ). Hence instead of using only the last layer (as done in Tag-LM or in Clove wherein the network is pre-trained and frozen), taking a linear combination of all the layers would be a better and rich estimation of contextual meaning of a word. ELMO representations are hence considered ‘deep’.

In ELMO, the language model is learnt in the usual way of predicting the next word form the previous sequence of words in a sentence in both directions. The loss function being the the negative log likelihood.

Source : Papers with Code

ELMO is a an improvement over Tag-LM and Clove and this can be attributed to the fact that the representations are ‘Deep’. The paper illustrated that ELMO achieved incremental performance gains on a variety of common NLP tasks.

Source : Original Paper

ELMO however continued to carry the shortcoming of depending more on the architecture of the downstream task for performance gains.

Compeer with ELMO was another proposal from Jeremy Howard et al. (, Universal Language Model Fine-tuning for Text Classification(ULMFiT). Apart from the pre-training step of Language modelling using RNN, ULMFiT proposed using an LM Fine-tuning on the target dataset. Rationale was to learn the ‘distribution’ or ‘task specific features’ of the target dataset. The final step is the task specific fine-tuning. For example a classifier(using few linear blocks).

Towards Transformers

So the transformers architecture becomes ‘the’ most underlying composition for more modern approaches such as the BERT family and the GPT series. First proposed in this paper ‘attention is all you need’ by Vashwani et al. 2018, it presents an alternative to RNNs (and its flavors) for processing sequential data.

Source: Original Paper

It’s better to refer to this excellent article by Jay Almar for a comprehensive understanding. In a concise form though, the architecture has following important elements.

  1. Multi Head -Self attention : At a very high level, self-attention allows referencing to other words and sub-space within the sequence to associate the meaning of a word. Basically, another way of capturing (long term)dependencies. Being ‘multi head’ means employing multiple heads to focus on multiple sub-spaces within the sequence with multiple representational spaces. Something like using multiple brains. Mathematically, the self-attention carrying embeddings are calculated using a softmax over keys.queries multiplied with the value matrix.
Source : Original Paper

2. (Sinusoidal) Positional encodings : Interestingly, the transformers is not sequential in nature. In-fact, it looks and processes the entire sequence. In such case, the positional encoding encapsulates the order of tokens. Its actually a vector value for an embedding created using sin and cos functions. An excellent reference here.

So what benefits does the transformers architecture bring over the RNNs (Bi-LSTMs)?

  1. Vanishing gradient: No concept of memory gates in transformers as this information loss prone method is circumvented by having direct access to all parts of the sequence.
  2. Long term dependencies : Transformers are better at capturing long term dependencies because of multi-head self attention layers.
  3. Bi-directional by design: So, the transformer encoder reads the entire sequence at once and uses all the surroundings of the words, both before and after. Therefor, its inherently bi-directional. In-fact, many argue that its non-directional.

Generative Pre-Trained Transfomers(GPT)

First introduced in 2018 by Radford et al. (just before BERT) GPT was one of the first to use the transformers architecture. The authors from OpenAI presented this architecture as an effective combination of existing ideas
a. unsupervised pre-training (as seen in ELMO) and b. Transformers.

Further the framework had two major components

1. Un-supervised pre-training (using Transformers) which is basically maximizing the likelihood of a token given a context of tokens on parameters of the network.

Source: Original Paper

The paper proposes using a multi-layer(12-layer) transformer decoder for this which basically constitutes of multi-headed self attention layer + positional feed-forward layers that produces a distribution over target tokens using a softmax. This variation of the original transformers architecture was uni-directional (left to right) as the self attention was attributed only from the left context.

2. Supervised Fine-Tuning: For a down-stream task such as classification, the labelled data is fed into the previous model for representations and fine tuning of the transformer decoder. An additional linear layer+ softmax layer facilitates the final classification task. The authors also propose adding an additional learning objective of learning a language model which demonstrates better generalization.

In addition to the aforesaid niche features, ‘scale’ was another attribute of GPT-1. It was trained on a massive BooksCorpus corpus with 240 GPU days. All subsequent models after GPT-1 were trained on large volumes of data with powerful GPUs/TPUs and with more and more parameters. GPT-1 successfully demonstrated that a transformer based on massive pre-training + little supervised Fine-Tuning with additional objective learning can cater to various NLP tasks(NLI, Question Answering and Classification). In-fact it did out-performed various sota models back then.

GPT-1 model however is uni-directional(left to right) in nature as self attention is based only on previous tokens. Something which is addressed by BERT.


Long journey indeed ! BERT(Bidirectional Encoder Representations from Transformers) was published shortly after GPT-1 from Google by authors Devlin et al. Overall, the approach looks very similar to what was presented in the GPT-1 architecture with a unsupervised language model learning and then a supervised fine-tuning step. However, BERT’s architecture is more like the original transformer’s architecture by Vaswani et al and is based on a multi-layer bidirectional Transformer en-coder. Wheras GPT-1 architecture is only a left context only(unidirectional) version of the original architecture, commonly referred as ‘transformer decoder’.


So, the main argument of the authors was that the unidirectional pre-training limits the representation for downstream tasks and hence is sub-optimal. For example, a unidirectional pre-trained model used for fine-tuning a Question Answering task is sub-optimal because context information from both directions are not exploited. Since BERT is bi-directional, a standard Language Model task is unfit as an objective learning task. This is because in transformers architecture, all words are fed at once (and hence accessible) to the model. For a standard LM task, each word can see itself from the future and hence the learning becomes trivial.

Source: Princeton COS 484

BERT addressed this by using ‘Masked Language Modelling ’ which is essentially masking random tokens in the text and predicting it.

Source: Princeton COS 484

In addition to MLM, BERT also employs another learning objective called ‘Next Sentence Prediction’. In NSP, the objective is to classify whether a sentence is a following sentence of another given sentence. Intuitively, this helps learning relations between sentences.

Source: Princeton COS 484

Fine-tuning is the second phase in BERT just like GPT-1. The modifications (input representation and output layers)are essentially task specific. For example for a classification task, the CLS (first special token in a setence) is fed to a classifier network. The learning is end to end which means that all layers and their weights continue to learn.

Source: Original Paper

BERT was proposed in two flavors, BERT base and BERT large. They primarily differ in number of layers(transformer blocks). Base=12 layers, 110M params and Large = 24 layers, 340 params. BERT is indeed a milestone in Natural language processing which successfully demonstrated a sota approach enabling Transfer Learning based on Transformers (self attentions), Bi-directionality and clever objective learning task. And off-course trained on a large scale corpus (BooksCorpus + English WikiPedia with 256 TPU days).

Beyond BERT
There has already been many advances after the original BERT paper in various tangents. There are more sophisticated variations such as RoBERTa which is trained for longer, on a larger corpus and employing clever learning objectives (such as dynamic masking and dumping NSP). Another variant called ALBERT for example which aims to produce a smaller model by using parameter reduction techniques. ELCTRA, XLNet are few other interesting variations.

Also, there has been some active research being pursued to make the BERT model light weight. (BERT large ~ 340 M parameters). There has been several approaches proposed for the same such as weights pruning, quantization and distillation(DistillBERT). Here is an excellent blog on the same:

I guess there has been a rapid and tremendous growth within the NLP world. From using statistical representations of text towards context aware neural representations. From statistics and classical ML based approaches to Deep learning based Sequence models. Discovering attention and bi-directionality on the way and realizing the power of Transfer Learning. And finally towards the sophisticated transformers architecture. Modern NLP frameworks have come a long way leveraging on these important milestones and scale.




Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: