ByT5: Towards a token-free future with pre-trained byte-to-byte models

Original Source Here

ByT5: Towards a token-free future with pre-trained byte-to-byte models

NLP Research Paper Summary

Image by Author

In this blog, I have tried summarizing the paper ByT5: Towards a token-free future with pre-trained byte-to-byte models as per my understanding. Please feel free to comment your thoughts on the same!


Most of the NLP research to date has extensively used the concept of Tokenizers to bifurcate a text sequence into smaller lexical units. Nowadays, you will find subword tokenization as the de-facto technique that people use for representing text units,(which was unigrams, bigrams at some point in the past)

Considering the limitations of these approaches, some of which are —

  • Not robust to handling OOV.
  • Variants in spelling, capitalization, etc result in a different representation.

Author’s propose Token-free models that operate directly on raw text (bytes), giving us the below-mentioned benefits —

  • They can process text in any language. We need not require language-specific tokenizers. [One tokenizer is all you need!]
  • They are robust to noise and minimize any hassle of having complex text preprocessing pipelines.
  • We now don’t need a huge vocabulary matrix as a byte-level model, by definition, only require 256 embeddings.
Comparison of pre-training example creation and network architecture between mT5 (Xue et al., 2020) and ByT5 (this work)| Image from Source

Although, one of the main drawbacks of the byte-level model is that the byte sequences are usually longer than original text sequences resulting in higher processing cost. As we know that the self-attention in transformers is a quadratic computation which poses a huge computational challenge when trying to process longer and longer sequences. Having said that, we do have advancements like Longformer, etc that make use of sparse attention and other clever techniques to handle the very large sequences.

mT5 vs ByT5 — Design

  1. mT5/T5 uses subword tokens, whereas ByT5 uses raw bytes as input to the model, making it agnostic to the type of text preprocessing, etc.
  2. mT5/T5 uses the concept of span masking as a self-supervised objective for pre-training the model on a large amount of unlabelled data. ByT5 uses a similar concept by masking out bytes. Also, mT5 masks out 3 subword tokens on average, here the author’s found longer mask sequence to benefit the model and hence they set their average mask span length to 20 bytes.
  3. mT5/T5 uses what is known as “balanced architectures” (depth of encoder==depth of the decoder), whereas, author’s of byT5 found it to work best when encoder depth is almost 3x that of decoder making the entire architecture encoder heavy. Also, even after decreasing the decoder’s capacity, they found the model to perform better on both classification and generation (translation/summarization) tasks.

Also, as a quality control protocol, since not all byte sequences are legal according to the UTF-8, author’s remove any invalid sequence by using python’s byte decode function — bytes.decode(“utf-8”, errors=“ignore”)

Performance Analysis

  1. Typically the vector representation for each token in the vocabulary takes on most of the parameters in the model’s total parameter space. For example, the vocabulary and softmax output matrices in the recent mT5-Base model takes about 66% of the total parameter count. With byte model, since that won’t be the case, if we were to compensate to the large model parameter count, we can have it by making our model more deep and wide giving us an edge by having a more sophisticated model.
  2. Byte sequences are typically longer for a given piece of text compared to using a word or sub-word tokenization scheme. Because of which you will have significantly higher computational cost because transformers use self-attention which has quadratic time complexity.


Abstractive Text Summarization (English Language)

They evaluate mT5 and ByT5 on XSum dataset for doing Abstractive Text Summarization. And as you can see in the below table that ByT5 outperforms mT5 for all size variants and comes close to the Pegasus model (17.0) that was specifically trained for abstractive summarization.

Performance of mT5 and ByT5 across different model sizes on GEM-XSUM | Image from Source

Text Classification (English Language)

They evaluate the performance of mT5 and ByT5 across different model sizes on GLUE and SuperGLUE tasks. As we can see in the below table that ByT5 is able to outperform mT5 only for small and base sized model. The author’s justify it by saying that this might be due to efficient parameter usage since most of the mT5 parameters are just locked as vocabulary matrix.

Performance of mT5 and ByT5 across different model sizes on GLUE and SuperGLUE | Image from Source

Also as you can see in the below table, that under the fixed-parameter count setting, as the size of the model increases, dmodel and dff for both the models become comparable unlike when the model size is low. Which is the possible reason for the behaviour exhibited by the above table.

Comparison of mT5 and ByT5 architectures | Image from Source

So yeah, that’s it for this blog. There are some more experiments that are mentioned in the paper. I encourage you to read them as well.

If you wish you can also checkout other research paper summaries that i have written.

Feel free to read the entire paper and say “Hi” to the authors and appreciate their contribution.

Paper Title: ByT5: Towards a token-free future with pre-trained byte-to-byte models

Paper Link:

Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Thank You!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: