Social Media(SMS) text to Formal English text translation — Grammatical Error correction using DL

Original Source Here

Social Media(SMS) text to Formal English text translation — Grammatical Error correction using Deep Learning

Table of contents:

1. Introduction

2. Problem Statement

3. Source of Data

4. ML/DL Formulation

5. Performance Metrics

6. Similar Approaches to the Problem

7. Machine Translation Types

8. Attention Model

9. Exploratory Data Analysis

10. First Cut Approach

11. Model Explanation

12. Comparison of Models

13. Working Demo

14. Future Work

15. References

Example of Translation


Social Media is an inevitable technology around us that helps us reach, connect, get information, share thoughts and ideas. In this current generation social media helps to get our work done in more efficient and quicker possible ways. There are several social media platforms that rules all over the world namely Whatsapp, Facebook, Twitter, Instagram, LinkedIn etc. These platforms play vital role for both formal and informal way of connections, conversations and share information regarding them.

Problem Statement:

Over the years of using various social media platforms, the nature of conversation with respect to text became listless among people. This in turn made conversation to shrink as much as possible. Having said that, is the conversation really meaningful in terms of grammar, spelling, formality and sentence formation? Not really to be. This is because to save time and space of content used by the users between them. What in case the actual colloquial words used are not understood by any one of the user? What if the informal or colloquial context is to be translated or transformed into formal context? That is where machine translation come into play and we are here to dive deep into how to translate the informal SMS texts into formal text.

Source of Data:

For this problem statement the data was collected from . The data set contains corresponding meaningful Italian and English sentences.

Overview of Data set:It has about 3,50,360 sentences.

Example of Data set

ML/DL Formulation:

This task of converting a given sentence of informal context to a sentence of formal context seems to have a sequence of words as input and a sequence of words as output. The important thing in this is the capturing of correct meaning while conversion or translating from one form to another. Thus this seems to be a sequence to sequence task and this can be measured using BLEU score, which is used to measure the words that target and predicted have in place. Being a sequence to sequence task, state of the art performance models in Natural Language Processing(NLP) such as Encoder-Decoder, Attention models, Transformers could be trained to perform the above mentioned task.

Performance Metrics:

Since the problem is about Sequence to Sequence there would be multiple outputs. Thus Softmax was used as the final layer to predict output. But Softmax by its nature would make the output sum to 1. Since we need to predict the words in the sentence, we would need actual token number to predict the word which was fed into the model during training. So a custom loss function was introduced in this problem statement to monitor the validation loss during model call.

Custom loss function

BLUE Score:: It calculates the n-grams of words in the target and predicted context, irrespective of the order present.

Actual — “
Hello How are you”
Predicted — “Hello How are you”
BLUE Score = 1

Actual — “
Hello Who are you”
Predicted — “Hello How are you”
BLUE Score = 0.75.

In Ideal case-BLEU Score = 1

Similar Approaches to the Problem:


Word Sense Disambiguation (WSD) systems are those which convert or translate informal context into formal and sensible context. In the above mentioned paper the translation from an informal context to formal context is explained. Apart from that it is not just converting short forms/abbreviations into expansions. It is about capturing the context meaning and transforming a given informal sentence into formal with appropriate meaning of the context in place.


This reference gives an insight about what is SMS text normalization all about. SMS text normalization is nothing but informal text conversations into formal English sentences. This paper follows two approaches namely dictionary substitution approach and Statistical Machine translation(SMT). These approaches were practiced and compared the results of those with three different kinds of data set to test the model to the core whether the model actually performs well even if tested beyond domain.

This paper gives an important idea regarding how the translation can be done effectively. By that experimentation with the paper’s point of view, data set must be well enough to get good performance. Well enough in the sense, large data set on training the model gives model effective results. Another view is the capturing meaning of the sentence with respect to the context. The dictionary substitution only expands the short forms irrespective of the context meaning, which says it does not match some short SMS forms according to the meaning it supposed to be.

Example: 2 u 2 — — — -> to you too — — — -> too you too — — — -> to you two — — — -> two you two.

From the above, the model can expand the message or SMS to any form, but it should be relevant to which context it is speaking. So the model must learn the entire sequence and translate rather than just translating the same words for all context. This would end up given no meaning to the sentences when translated.
Thus Statistical Machine Translation followed to prevent these errors. In the paper Statistical Machine Translation system built outperforms other systems in translation or called as text normalization. The only concerns were there must be even more data for training so that the model could perform even more better in wide range.

Machine Translation Types:

Normally the translations that were used other than basic Encoder-Decoder also have same processes of token computation, Part of speech tagging, vector computation etc.

There are different forms of translation from a source language to target language which kept improving over generations.

There are three broad approaches of machine translation:

  • Rule-Based Machine Translation (RBMT)
  • Statistical Machine Translation (SMT)
  • Neural Machine Translation (NMT)

The variations among these approaches make one better than the other. The Rule-Based Machine Translation (RBMT) models work with expert’s knowledge over source language as well as target language. It also requires dictionaries in order to perform better translation and also they possess rules of grammar of both the source language and target language. A rule set for source language and rule set for target language. All these are required to frame the semantic and syntactic rules to achieve the translation.

Similarly, the Statistical Machine Translation (SMT) also requires dictionary translation. On the foundation, this approach uses Bayes theorem to translate the source language to target language. This approach focus mainly on phrase translation meaning, it focuses more on words and its direct meaning in target language. This makes it difficult to make sense when translating lengthy sentences.

On the other hand, Encoder-Decoder makes things simpler making end to end models using neural networks. This doesn’t require any domain knowledge or linguists to perform translation like other approaches. Since neural networks takes charge of the tokens passed over time axis and the model learns the appropriate target through back propagation. This makes things simpler as the model itself learns during training feeding both source language and target language by proper token computation.

Credits: Applied AI Course (Applied Roots)

Basic Encoder-Decoder Model Geometric Architectural Representation

Thus by this the basic Encoder-Decoder model differs from the other translation models though it also encodes and decodes the input and comes up with required output.
x1, x2.. xt are the input words from a sentence which belongs to Encoder.
w is the output from final Encoder. y1, y2.. are the outputs from Decoder.

Attention Model:

Although the Encoder-Decoder model performs sequence to sequence tasks, the problem arises if the input and output sequence of sentence becomes lengthy in nature.

As we know that the Encoder-Decoder model captures the whole essence on the sentence in last Encoder and send them to Decoder, this performs weak when it comes to larger sentence. When a meaning of a word in a sentence depends more on the words that come very beginning of the sentence or at the very last of the sentence depends on the context. Attention models consider this contextual meaning and pays attention to entire sentence before converting or translating them into relevant output.

Credits: Applied AI Course (Applied Roots)

Attention Model Geometric Architectural Representation

Unlike Encoder-Decoder model, the first cell Decoder of Attention model receives all the Encoder cell outputs (weighted output). This makes the Attention concentrate and pay attention to the entire sentence and considers the weights as the trainable parameter during training. This helps Attention model translate lengthier sentences with good accuracy.

Exploratory Data Analysis:

We have taken data from relevant source where the data set contains almost 350360 sentences of English and French words. For our task we need English data and that can be obtained from this data set. This data set contains varied lengths of sentences. For handle with ease, sentences that are lengths more than 5 in length and that are less than or equal to 20 in length are taken. Finally to avoid repetition of sentences, duplicates are dropped out of data set.

KDE Plot with Sentence Lengths
Optimize Lengths
Drop Duplicates

Finally a new data set with almost 70000 data points created by converting the formal English words into informal words. Now this can be used for the training of the model.

Function to Generate Informal Sentences
Generation of Informal SMS Sentences

First Cut Approach:

· Collecting / Downloading Data from Relevant Resource

· Pre-Processing Data

· Building Deep Learning Model with Hyper Parameter Tuning

· Fitting Data upon Appropriate Model with Best Parameters

Approach 1:

Encoder-Decoder model should be analyzed with its results using the test data set and observe the performance metrics for all the models.

Trying out several parameters to be carried out such as:
Changing number of layers, activation functions, optimizer, drop outs and normalization in necessary and possible places to try out performance variations of the model and bring out the best out of each model.

Approach 2:

Attention model should be analyzed with its results using the test data set and observe the performance metrics for all the models.

Similar to Encoder-Decoder, trying out several parameters to be carried out.

Both the Attention and Encoder-Decoder model to be analyzed for the performance and results to be compared. Several parameter changes to be made in this experimentation and find out best parameters. This could be done by extensive experimentation. By which a good performing model could be made.

Model Explanation:

Encoder-Decoder Model :

Encoder-Decoder Code

Attention Model:

Attention Code

Comparison of Models:

From the Analysis, Encoder-Decoder Model under-performed Attention Model since as discussed earlier Attention models are more efficient for capturing the meaning of a sentence.

Encoder-Decoder Model:
Input : ‘would u wake me up every mornng’
Output : ‘would met would met would met would met would met would met’

Attention Model:
Input : ‘would u wake me up every mornng’
Output : ‘would you wake me up every morning’

Working Demo:

Future Work:

Using other functions in attention models like Bahdanau Attention.



Note Book link:

LinkedIn profile:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: