Original Source Here
Building the Tokenizer
When building a new tokenizer, we need a lot of unstructured language data. My go-to for this is the OSCAR corpus — an enormous multi-lingual dataset that (at the time of writing) covers 166 different languages.
However, there are many datasets out there. HuggingFace’s
datasets library also provides easy access to most of these. We can see just how many with Python:
A cool 1306 datasets. Many of these are ginormous too — OSCAR itself is split into 166 languages, and many of those ‘portions’ of OSCAR contain terabytes of data.
We can download the OSCAR Italian corpus using HF’s
datasets. However, we should be careful as the full dataset contains 11.3B samples. A total of ~69GB of data. HF allows us to specify that we’d like only a portion of the full dataset using the
split parameter we have specified that we would like the first
2000000 samples from the
train dataset (most datasets are organized into
test sets). Although this will still download the full
train set — which will be cached locally for future use.
We can avoid downloading and caching the full dataset by adding the
streaming=True parameter to
load_dataset — in this case
split must be set to
"train" (without the
After downloading our data, we must reformat it into simple plaintext files where a newline separates each sample. Storing every sample in a single file would create one — huge — text file. So instead, we split them across many.
Once we have saved all of our simple, newline separated plaintext files — we move on to training our tokenizer!
We first create a list of all of our plaintext files using
And then, we initialize and train the tokenizer.
There are a few important arguments to take note of here, during initialization we have:
clean_text— cleans text by removing control characters and replacing all whitespace with spaces.
handle_chinese_chars— whether the tokenizer includes spaces around Chinese characters (if found in the dataset).
stripe_accents— whether we remove accents, when
Truethis will make é → e, ô → o, etc.
Truethe tokenizer will view capital and lowercase characters as equal; A == a, B == b, etc.
And during training, we use:
vocab_size— the number of tokens in our tokenizer. During later tokenization of text, unknown words will be assigned an
[UNK]token which is not ideal. We should try to minimize this when possible.
min_frequency— minimum frequency for a pair of tokens to be merged.
special_tokens— a list of the special tokens that BERT uses.
limit_alphabet— maximum number of different characters.
workpieces_prefix— the prefix added to pieces of words (like
##boardin our earlier examples).
After we’re done with training, all that is left is saving our shiny new tokenizer. We do this with the
save_model method — specifying a directory to save our tokenizer and our tokenizer name:
And with that, we have built and saved our BERT tokenizer. In our tokenizer directory should find a file —
vocab.txt is used to map text to tokens, which are then mapped to token IDs based on the row number of the token in
vocab.txt — those IDs are then fed into BERT!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot