Original Source Here
2. Business Problem
The business problem is, detect at least 30% of grammatical errors in the text/s and correct them in a reasonable turnaround time and optimum CPU utilization. A GEC system in a low resource setting can serve as a word processor, post editor and for learners of the language as a learning aid.
3. Mapping to Machine Learning Problem
The above business problem can be solved using statistical models, rule-based models and neural machine translation models. In this case study, we have experimented with a neural machine translation approach.
The loss function that we use here is sparse categorical cross-entropy, which is loss calculation using labels and predictions.
4. Understanding the Data
We have collected the dataset from IndicCorp is one of the largest publicly available corpora for Indian languages.
We extracted the dataset by running the curl wget widget.
We examined the first few lines of the text file we have extracted using the head command.
5. Data Pre-Processing
In data pre-processing, we need to construct real and inflected pairs of sentences, for each record in the data set. For simplicity, we will consider the input text from the dataset as the original correct(real) sentence. From this, we would need to construct the erroneous sentences. To construct erroneous sentences, we would need to tokenize our entire dataset and individually parse each through a POS Tagger, to get the entity of each token.
Using regex and dictionary in python we have generated the inflections on each of the sentences according to the parts of speech and stored them in separate pickle files.
An alternative approach to generate inflection
We installed the inltk[inltk] library, then using the joblib library and Parallel processing generated 3 inflections to the original sentences. This is like Data augmentation in image processing. The sentences generated using this approach, some of them were inflected but the others remained unchanged. This method did not cover all the function points from a grammatical error inflection. Hence this method is a failure in generating inflection. In the future sections, we will discuss how we trained the model using this approach and the model failed.
The reader of this article should be familiar with the basics of Python programming, Machine Learning and Deep Learning, TensorFlow and building data-driven applications.
There were many architectures, experimented with 30k dataset on Google Colaboratory with CPU as a hardware accelerator for dataset collection and pre-processing step and GPU as a hardware accelerator for model train and predictions. With those architectures, the grammatic error handling was not at all satisfactory.
Finally, we chose Bahdanau’s additive attention, as detailed in the TensorFlow Blog for machine translation.
We have used the BPEmb in the Encoder layer to get the sub-word embeddings. We have trained 3 models separately
- Model1: The number of words in sentence <= 6. The default parameters left as it is for the Adam optimizer. Without any regularization, the model seemed to overfit.
- Model2: The number of words in sentence <= 6. The learning rate of the Adam optimizer was chosen to be 5e-5. We used an l2 regularizer for the Encoder Dense layer. The predictions were average on the validation dataset
- Model3: The number of words in sentence > 6. The learning rate of the Adam optimizer was chosen to be 5e-5. We used the l2 regularizer for the Encoder Dense Layer. The predictions were good on the validation dataset too.
We have trained the model and a quickly taken single batch of input reduces to zero in 50 epochs. Therefore this is an overfit model.
We also did random predictions for few records in the training dataset. The errors were handled, but those words that did not change from the input sentence to the output were randomly predicted.
We have trained the model for 500, and quickly taken a single batch of input that does not reduce to zero. Hence this model is not an overfit model. However, the predictions are not satisfactory. This may be mainly is due to the lesser number of inflections in the input dataset, due to its reduced length.
We have trained the model and loss was reduced to zero in 500 epochs for the entire batch. Therefore this is the best model in our current environment settings. We have limited the max length of each sentence based on the majority of the sentence length to 10 while doing the translation. But the model seems to faithfully correct all of the tokens it is presented at a given point of time in a lesser turnaround time with optimum CPU utilization.
We have chosen to use both the BLEU score and ROUGE-L score as our performance metrics for all the models. Here is a comparison of the scores across all the models.
Following are the rouge-l scores of all three models for training and validation datasets. The total size of the original dataset is 30k. We are considering random 100 sentences in training and validation sets for the calculation of rouge-l scores. The following table summarizes the F1 scores of the training and validation datasets. Based on the F1Scores we can conclude that Model3 is performing the best amongst the given models.
The translate functions were exposed as service in TensorFlow using the TensorFlow API. and can be deployed on the TensorFlow server.
The following links load the demonstration of the predictions in Google Colaboratory.
10. Conclusions and Future Work
We can conclude that for the given setup the model 3 performs the best amongst the rest of the models.
I am pursuing the following ideas on
- Generate more parallel corpus and Dataset for error inflections
- Hyperparameter tuning on more advanced grammar concepts like Phrases and clauses
Full source code is available on Githublink2
If you have any questions or suggestions on any of the above please connect with me on LinkedIn
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot