Original Source Here
Machine Translation Evaluation with Cometinho
Practical advice to reduce your model size and save computation time and money, with an eye on performance
The European Association for Machine Translation (EAMT) conference is a venue where MT researchers, users and translators gather to discuss the latest advances in the industry. It is really interesting to go there and see what is going on in the European continent in terms of MT development and adoption. In this article, I want to share some ideas from the Best Paper Award of this year. Its title is “Searching for COMETINHO : The Little Metric That Could”, from the research lab of Unbabel, a company based in Lisbon, Portugal that offers translation services using MT and human translators. You can find the online version of the paper in the ACL Anthology.
The goal of the paper is to show how it is possible in a real scenario to use several techniques from the literature to speed-up the execution of a very large model and save computation, while aiming for a reduced loss in translation quality. The paper focuses on reducing Comet, a model for machine translation (MT) automatic evaluation, but many points described in it are general enough to make it interesting for anybody who deals with large and slow models and needs to save computation for money or time reasons.
What is Comet?
Comet, originally described in this paper, is a model for machine translation evaluation based on XML-R. The large language model is used to separately encode the source, the hypothesis and the reference sentences, in order to reduce each of them to a single vector. Then, these vectors are combined to produce in output a single real number that represent a score for the hypothesis quality.
Comet has been ranked high at the metrics shared task at WMT for correlation with human judgement. However, it is a quite large model with its 500M parameters, and it is slow to run. This is particularly a problem when we need to run it often, for instance with minimum risk decoding or training.
The first pair of speed-ups obtained in the Cometinho paper regard pure code modifications. They are, namely, length sorting and caching.
Deep learning models are run in mini-batches to make better use of GPU parallelization capabilities. With texts and other variable-length inputs, a mini-batch will have the length of the longest sequence in it. The other sequences are padded with zeroes, and all the computation performed for the padded positions is basically wasted. An effective and quite known way to reduce the number of wasted computations is to sort the sentences by their length before starting the decoding process. Length sorting always yields speed improvements, and it is fast and easy to implement. Thus, it should always be used when performing inference in batch mode and all the most famous deep learning tools implement it.
Caching is made possible by the way Comet encodes its triplets: source, hypothesis and reference are encoded into three separate vectors. This way, it is possible to store the vector that represents a sentence and just retrieve it when it occurs again. How often it occurs, are you asking? Assume that you are using Comet to compare two different MT models. The two models are run through the exact same test set, thus all the source and reference sentences occur at least twice. Another example is when Comet is used for minimum Bayes risk decoding, many hypotheses are computed for the same source sentence and again the saving is huge. The idea is that, if you are going to design a deep learning model for a task where speed matters, try to design it in a way that can leverage caching to save important resources, when possible.
The combination of length sorting and caching brings a speed-up of up to 39% when scoring one system, and 65% when scoring 8 systems, and caching is thus more effective.
Then, it comes the part of actually reducing the size of the model, and this paper experiments with two methods to do it. The first one is to physically remove entire layers from the model, the second one is to reduce some sub-layers sizes.
In the first case, a layer is considered to be a full Transformer layer, including self-attention, feed forward, residual connections and layer normalization.
In the case of this paper, it is easy to remove layers from the top of the encoder model, as the pooling layer does not use directly the output of the last encoder layer. Rather, they use the same approach proposed to generate embedding vectors with Elmo and take a weighted average of the layers’ outputs. The weights, which are all positive and normalized to 1.0, are learned during training and kept static afterwards. A quick analysis of the weights shows that the top-most layers are the lowest weighted, and thus those layers contribute the least*. Then the authors decide to just remove them, and this causes no harm during inference since the encoder output is always an average between many layers. The results show a slight performance improvement when removing the 4 topmost layers, and no harm when removing 5 layers. Considering that the model consists of 25 layers, this is a 20% parameter reduction for free (excluding the embedding weights that still represent the majority of the parameter count but do not affect inference that much).
The second method consists in removing two heads (out of 16) from each self-attention layer. Additionally, they also reduce the feed-forward size from 4096 to 3072 (75% of the original size). The resulting model, called Prune-Comet, achieves results comparable to Comet but with a 37% speed-up in inference time.
Pruning is performed using TextPruner, an open source tool that implements layer pruning for many pre-trained Transformers. Look it up if this is your task.
* The weight reduces proportionally the vector’s magnitude, but this is only part of the story. This paper by Kobayashi and colleagues shows that the vector original magnitude has to be considered as well. This makes sense also intuitively, since the external weights can also rebalance unbalanced vectors.
While layer pruning helps in reducing the parameter count with small degradation, the model is still too large to be used efficiently. Its large size, indeed, largely affects computation power and time, particularly for applications that require repeated MT evaluation computation, such as minimum Bayes risk decoding. In order to achieve a significant parameter reduction, knowledge distillation is needed.
With knowledge distillation, a new, smaller model is trained using the original model’s outputs as learning targets. For this specific task, the authors use the OPUS corpus for 15 language pairs, translate its 25M sentence pairs with two MT models available in Hugging Face Transformers with different quality, and use an ensemble of 5 Comet models to evaluate the translations. Since these are machine translation training data without human evaluation, they were not used to train Comet.
The student model keeps the same deep learning architecture of the teacher models, but it is much smaller: 12 layers, 12 attention heads, feed-forward size of 1576 and embedding vector size of 384. Its parameter count is 119M parameters instead of the original 582M.
The final results show that with a model size reduction of about 80% it loses 10% in quality and gains 53% in speed. It’s true that a new training is needed, but in real application scenario training is done only once and inference thousands of times or more.
The Cometinho paper is of practical interest because it shows how to apply model reduction techniques that can be transferred with ease to different application domains.
What should you take home? First, there are many methods to save computation with small to no performance loss, and they can be reused among different domains. Use them whenever you can.
Second, in a sense, models should be designed to be reduced. In this paper’s example, it is the Elmo-like layer average that allows layer pruning.
Third, a side note, the paper would be more complete if they also showed results for training a baseline model from scratch to the reduced size, rather than using the techniques proposed here. It is always better to show that reducing an existing model is better than training a smaller one.
If you are interested in the topic, your next step can be to study about quantization as a further technique to reduce the model size and improve the inference time.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot