How to Build a WordPiece Tokenizer For BERT

Original Source Here

Building the Tokenizer

When building a new tokenizer, we need a lot of unstructured language data. My go-to for this is the OSCAR corpus — an enormous multi-lingual dataset that (at the time of writing) covers 166 different languages.

However, there are many datasets out there. HuggingFace’s datasets library also provides easy access to most of these. We can see just how many with Python:

A cool 1306 datasets. Many of these are ginormous too — OSCAR itself is split into 166 languages, and many of those ‘portions’ of OSCAR contain terabytes of data.

We can download the OSCAR Italian corpus using HF’s datasets. However, we should be careful as the full dataset contains 11.3B samples. A total of ~69GB of data. HF allows us to specify that we’d like only a portion of the full dataset using the split parameter.

Inside our split parameter we have specified that we would like the first 2000000 samples from the train dataset (most datasets are organized into train, validation, and test sets). Although this will still download the full train set — which will be cached locally for future use.

We can avoid downloading and caching the full dataset by adding the streaming=True parameter to load_dataset — in this case split must be set to "train" (without the [:2000000]).

Data Formatting

After downloading our data, we must reformat it into simple plaintext files where a newline separates each sample. Storing every sample in a single file would create one — huge — text file. So instead, we split them across many.


Once we have saved all of our simple, newline separated plaintext files — we move on to training our tokenizer!

We first create a list of all of our plaintext files using pathlib.

And then, we initialize and train the tokenizer.

There are a few important arguments to take note of here, during initialization we have:

  • clean_text — cleans text by removing control characters and replacing all whitespace with spaces.
  • handle_chinese_chars — whether the tokenizer includes spaces around Chinese characters (if found in the dataset).
  • stripe_accents — whether we remove accents, when True this will make é → e, ô → o, etc.
  • lowercase — if True the tokenizer will view capital and lowercase characters as equal; A == a, B == b, etc.

And during training, we use:

  • vocab_size — the number of tokens in our tokenizer. During later tokenization of text, unknown words will be assigned an [UNK] token which is not ideal. We should try to minimize this when possible.
  • min_frequency — minimum frequency for a pair of tokens to be merged.
  • special_tokens — a list of the special tokens that BERT uses.
  • limit_alphabet — maximum number of different characters.
  • workpieces_prefix — the prefix added to pieces of words (like ##board in our earlier examples).

After we’re done with training, all that is left is saving our shiny new tokenizer. We do this with the save_model method — specifying a directory to save our tokenizer and our tokenizer name:

And with that, we have built and saved our BERT tokenizer. In our tokenizer directory should find a file — vocab.txt.

Screenshot of the vocab.txt file — our new tokenizer text to token ID mappings.

During tokenization vocab.txt is used to map text to tokens, which are then mapped to token IDs based on the row number of the token in vocab.txt — those IDs are then fed into BERT!

A small section of vocab.txt showing tokens and their token IDs (e.g., row numbers).


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: