Original Source Here
Evolution Of Natural Language Processing(NLP)
In this article I want to share about the evolution of text analysis algorithms in last decade.
Natural Language(NLP )has been around for a long time, In fact, a very simple bag of words model was introduced in the 1950s.
But in this article I want to focus on evolution of NLP during recent times.
There has been enormous progress in the field since 2013 due to the evolution and the advancement of machine learning algorithms together with reduced cost of computation and memory.
In 2013, a research team led by Thomas Michael off at Google introduced the Word2Vec algorithm.
Word2Vec converts text into vectors also called Embeddings. Each of those vectors consists of 300 values, so it is simply called 300 dimension vector.
Since it represents a 300 dimensional vector space. You can then use those vector representations as inputs to your machine learning.
Using these vector we can then apply algorithms like, K nearest neighbor classification or clustering algorithms.
Word2Vec is famous for the two different model architectures,
- continuous back of words(CBOW)
- continuous skip gram
The architectures are based upon shallow two layer neural networks.
CBOW predicts the current word from a window of surrounding context words, whereas continuous skip gram uses the current word to predict the surrounding window of context words.
One challenge though with Word2Vec is that it tends to run into what’s called out of vocabulary issues, because its vocabulary only contains three million words. The model architecture assigns a zero to that word which is basically discarding the word.
In 2014, a research team led by Jeffrey Pennington at Stanford University introduced GloVe or Global Vectors.
GloVe novel approach used the regression model to learn word representations through unsupervised learning.
The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.
GloVe still has the problem of out-of-vocabulary.
In 2016, Facebook’s AI Research (FAIR) lab published their work on FastText.
They said, It is the library for efficient text classification and representation learning
FastText builds on Word2Vec but it treats each word as a set of sub words called the character n-grams. And this helps with the out of vocabulary issue as in Word2Vec and Glove.
For example the word “farming” was divided into n-grams as
“farming” => “f” , ”fa” , ”far” , ”farm” , “farmi” , “farmin” , “farming”
even if the word “farming” is not in the vocabulary, chances are that the “farm” is. The embedding that FastText learn for a word is the aggregate of the embeddings of each n-gram, FastText uses the same CBOW and skip-gram models, FastText increases the effect of vocabulary of Word2Vec beyond the three million words.
Another large milestone in the evolution of text analysis was the Introduction of the Transformer Architecture in 2017 in a paper called ” Attention Is All You Need”.
Introduced a novel neural network architecture based on a self attention mechanism(weighted sum of the hidden states are passed as the context vector to the future time-step usually to the decoder part of sequence-to-sequence RNN).
The concept of attention had been studied before for different model architectures and generally refers to one model component capturing the correlation between inputs and outputs. In NLP terms, the attention would map each word from the model’s output to the words in the input sequence, assigning them weights depending on their importance towards the predicted word. The self attention mechanism in this new transformer architecture focuses on capturing the relationships between all words in the input sequence and thereby significantly improving the accuracy of natural language understanding tasks such as machine translation. While the transformer architecture marked a very important milestone for NLP, other research teams kept evolving keeping this as the root to alternative architectures.
In 2017 AWS introduced BlazingText and It is AWS based, They are saying, BlazingText provides highly optimized implementations of the Word2Vec and text classification algorithms. BlazingText scales and accelerates Word2Vec using multiple CPUs or GPUs for training. Similarly, the BlazingText implementation of the text classification algorithm extends FastText to use GPU acceleration with custom CUDA kernels. CUDA or compute unified device architecture is a parallel computing platform and programming model developed by Nvidia. Using blazing text, you can train a model on more than a billion words in a couple of minutes using a multi core CPU or GPU. BlazingText creates character n-gram and embeddings using the continuous bag of words and skip gram training architectures, BlazingText also allows you to stop training your model training on event basis lets say when validation accuracy stops increasing. BlazingText also optimizes the IO for datasets stored in Amazon simple storage service or Amazon S3.
Embeddings from Language Models(ElMo), NLP framework developed by AllenNLP, 2018.
They like to it the Deep contextualized word representations.
ELMo is a novel way to represent words in vectors or embeddings. These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.
In ELMO, word vectors are learned by a deep bidirectional language model. ELMO combines forward and backward language model and is thus able to better capture syntax and semantics across different linguistic contexts.
Generative pre training(GPT) by OpenAI, 2018. Is the series of semi-supervised learning models.
GPT is based on the transformer architecture but performs two training steps
- GPT learns a language model from a large unlabeled text corpus.
- GPT performs a supervised learning step with labeled data to learn a specific NLP tasks such as text classification.
GPT is unidirectional.
Bidirectional encoder representations from transformers(BERT), Google AI Language, 2018.
State of the art language model for NLP.
BERT takes into account the context for each occurrence of a given word,
consider two sentences,
“He is running a company” and “He is running a marathon”
for the word “running” BERT will provide a contextualized embedding that will be different according to the sentence.
BERT is truly bidirectional,
In the unsupervised training step, BERT learns representations from unlabeled text, from left to right as well as right to left contexts.
The original English-language BERT has two models:
(1) the BERT_BASE: 12 Encoders with 12 bidirectional self-attention heads, and
(2) the BERT_LARGE: 24 Encoders with 16 bidirectional self-attention heads.
Both models are pre-trained from unlabeled data extracted from the BooksCorpus with 800M words and English Wikipedia with 2,500M words.
This novel approach created interest in BERT across the industry and has led to many variations of BERD models. Some of which are language specific, domain specific and BERT model is most popular among its kind.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot