Parallel Sentence Alignment in Python



Original Source Here

Parallel Sentence Alignment in Python

Utilizing a machine translation-based sentence-alignment tool

Photo by michal dziekonski on Unsplash

By reading this piece, you will learn to align parallel sentences from two monolingual files (ordered sentences but not aligned properly). Let’s say that you have the following English sentences (taken from the translated English version of The Metamorphosis by Franz Kafka)

One morning, as Gregor Samsa was waking up from anxious dreams, he discovered that in bed he had been changed into a monstrous verminous bug.He lay on his armour-hard back and saw, as he lifted his head up a little, his brown, arched abdomen divided up into rigid bow-like sections.From this height the blanket, just about ready to slide off completely, could hardly stay in place.His numerous legs, pitifully thin in comparison to the rest of his circumference, flickered helplessly before his eyes."What's happened to me," he thought.It was no dream.

and the corresponding Italian sentences as follows:

Gregorio Samsa, svegliandosi un mattino da sogni agitati, si trovòtrasformato, nel suo letto, in un enorme insetto immondo.Giacevasulla schiena, dura come una corazza e, sollevando un po' latesta, vide un addome arcuato, scuro, attraversato da numerosenervature.La coperta, in equilibrio sulla sua punta, minacciavadi cadere da un momento all'altro; mentre le numerose zampe,pietosamente sottili rispetto alla sua mole, gli ondeggiavanoconfusamente davanti agli occhi."Che mi è successo?" pensò.Non era un sogno.La sua camera, unavera camera per esseri umani, anche se un po' piccola, stava benferma e tranquilla tra le sue quattro note pareti.

Noticed that both the English and Italian sentences are not aligned line by line due to the fact that the translated version uses a semicolon (;) to combine two sentences into one.

As a result, this dataset is not usable if you intend to train a machine translation model since the translated sentences are offset by a line. Under normal circumstances, you can always fix it manually but it can be overwhelming for large corpora that are counted in terms of millions of sentences.

Let’s explore how you can create a subset of the original dataset with aligned sentences via a Python package called Bleualign.

Bleualign

This Python package helps to align and pair up source text and its translated target text on a sentence level. However, it requires the source text to be automatically translated beforehand for comparison against the translation text. Also, the sentences must not be shuffled and be in order.

Setup

It is highly recommended to setup a virtual environment before you continue. Activate it and run the following command to install via pip install:

pip install git+https://github.com/rsennrich/Bleualign.git

The next step is to copy the following files in your directory:

Concept

Before that, let’s explore the underlying process to execute a single run of sentence alignment. It works as follows:

  1. Translate source text using any machine translation API (Google Translate, etc.). The translated text must correspond to the source text line by line. There should not be any line breaks or empty lines in both of the files.
  2. Run script to calculate the similarity (modified BLEU) between the target text and the auto-translated source text.
  3. The system will align the sentences based on the similarity score

Bleualign also works if you translate both source text and target text. Then, it will simply get the intersection of both results as the final output. This provides high-quality alignments.

Dataset

For experimental purposes, I have extracted text from a novel and split it into sentences via regular expression. The final dataset consist of the following number lines:

  • English source sentences: 6,149 lines
  • Italian target sentences: 5,579 lines

After that, I translated the content via Google Translate (the number of lines must be the same as the source or target sentences):

  • Translated source sentences: 6,149 lines
  • Translated target sentences: 5,579 lines

Sentence Alignment

Assuming that the dataset (source, target, translated source) is located in the same directory as bluealign.py script, run the following script to start the sentence alignment process:

python bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt --targettosrc targettranslation.txt -o outputfile

It accepts the following arguments:

  • -s: source file path
  • -t: target file path
  • --srctotarget: translated source file path
  • --targettosrc: translated target file path (optional)
  • -o: output file prefix

For better accuracy, you have to translate the target text and execute the following command instead:

python bleualign.py -s sourcetext.txt -t targettext.txt --srctotarget sourcetranslation.txt --targettosrc targettranslation.txt -o outputfile

You should see the following output on your terminal:

reading in article 0:
processing
computing alignment between srctotarget (file 0) and target text
Evaluating sentences with bleu
finished
searching for longest path of good alignments
finished
filling gaps
finished
computing alignment between targettosrc (file 0) and source text
Evaluating sentences with bleu
finished
searching for longest path of good alignments
finished
filling gaps
finished
intersecting both directions
finished with article

Result

Besides that, it will generate two additional files based on the prefix that you have set:

  • output-s: output source text after sentence alignment
  • output-t: output target text after sentence alignment

The result is not perfect but decent enough as the initial parallel dataset. In this case, the script aligned 4442 lines of parallel sentences in just a few seconds (ignoring the time taken to translate the source and target sentences). Imagine the amount of time taken if you were to align it manually on your own. All you need to do is to verify and make some final touch up on the output to ensure a quality parallel dataset for your machine translation model.

The algorithm does not work really well when short sentences are not broken up properly. For example, the following issue shows up:

# English
Yes.
Go ahead. I'll wait for you.
# Italian
Sí. Allora vai.
Ti aspetto.

Batch Mode

There is also a script that processes files in batches. Assuming that you have the following:

  • raw_files: a folder called which contains all the dataset
  • 0.en: source file
  • 0.it: target file
  • 0.trans: translated source file
  • 1.en: source file
  • 1.it: target file
  • 1.trans: translated source file

Simply run the following command:

# syntax
python batch_align.py directory source_suffix target_suffix translation_suffix
# example
python batch_align.py raw_files en it trans

It will generate the following files as output:

  • 0.en.aligned
  • 0.it.aligned
  • 1.en.aligned
  • 1.it.aligned

Notes

One major disadvantage is that you need to obtain a translated version of your source sentences before you can perform sentence alignment. Apart from doing auto-translation with an external API such as Google Translate, you can always train a simple machine translation using an open source dataset to assist in the initial translation.

For more information on the algorithm, kindly read the following papers:

Rico Sennrich, Martin Volk (2010): MT-based Sentence Alignment for OCR-generated Parallel Texts. In: Proceedings of AMTA 2010, Denver, Colorado.

Rico Sennrich; Martin Volk (2011): Iterative, MT-based sentence alignment of parallel texts. In: NODALIDA 2011, Nordic Conference of Computational Linguistics, Riga.

Conclusion

Let’s recap what you have learned today.

This article started off with a simple problem statement highlighting the issue faced when building parallel corpora for the machine translation model.

It moved on to installation and data preparation steps which require that source sentences be translated beforehand.

Then, it explored the fundamental concepts behind Bleualign which compute similarity (modified BLEU) between the target sentences and translated source sentences.

Finally, it covered on example command and the results of sentence alignment. Also, Bleualign supports processing files in batches via one of its scripts.

Thanks for reading this piece. Have a great day ahead!

References

  1. GitHub — Bleualign
  2. Franz Kafka. (1915) The Metamorphosis

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: