In-Depth spaCy Tutorial For Beginners in NLP

Original Source Here

In-Depth spaCy Tutorial For Beginners in NLP

Learn the Scikit-learn of Natural Language Processing

Photo by Anni Roenkae


No, we won’t be building language models with billions of parameters today. We will start smaller and learn about the basics of NLP with spaCy. We will closely look at how the library works and how you can use it to solve beginner/intermediate NLP problems with ease.

The post is already long, so I’ll cut it here and jump to the meat of the article.

What is spaCy?

spaCy is like the Sklearn of natural language processing. It is an industry standard with vast features to solve many NLP tasks with state-of-the-art speed, accuracy, and performance.

At its core are pipelines, which you can think of as language-specific models already trained on millions of text instances.

It is also at the head of the spaCy ecosystem that includes dozens of libraries and tools such as Prodigy, Forte, displaCy, explacy, ADAM, Coreferee, etc.

spaCy can also shake hands with custom models from TensorFlow, PyTorch, and other frameworks.

Right now, spaCy supports 66 languages as separate pipelines, and new languages are being added slowly.

Basics of spaCy

Before seeing how spaCy works, let’s install it:

pip install -U spacy

spaCy has three pipelines for English, with varying sizes and functionality for complex tasks. In this tutorial, we will only need to install the small and medium pipelines, but I included the large one as well for completeness:

After importing spaCy, we need to load one of the pipelines we just installed. For now, we will load the small one and store it to nlp:

It is a convention to name any loaded language models nlp in the spaCy ecosystem. This object can now be called on any text to start information extraction:

# Create the Doc object
doc = nlp(txt)

The doc object is also a convention, and now it is already filled with extra information about the given text’s sentences and words.

In general, the doc object is just an iterator:

You can use slicing or indexing notations to extract individual tokens:

>>> type(token)spacy.tokens.token.Token>>> len(doc)31

Tokenization is splitting sentences into words and punctuation. A single token can be a word, a punctuation or a noun chunk, etc.

If you extract more than one token, then you have a span object:

>>> span = doc[:5]
>>> type(span)
spacy.tokens.span.Span>>> span.text'The tallest living man is'

spaCy is also built for memory efficiency. That’s why both token and span objects are just views of the doc object. There is no duplication.

The pre-trained English pipeline and many other pipelines have language-specific rules for tokenization and extracting their lexical attributes. Here are 6 of such attributes:

Some interesting attributes are lemma_, which returns the base word stripped from any suffixes, prefixes, tense, or any other grammatical attributes, and the like_num which recognizes both literal and lettered numbers.

You will be spending most of your time on these four objects — nlp, doc, token and span. Let’s take a closer look at how they are related.

Architecture and core data structures

Let’s start again with the nlp, which is a Language object under the hood:

Language objects are pre-trained on millions of text instances and labels and loaded into spaCy with their binary weights. These weights allow you to perform various tasks on new datasets without worrying about the hairy details.

As I mentioned earlier, spaCy has fully-trained pipelines for 22 languages, some of which you can see below:

For other +40 languages, spaCy only offers basic tokenization rules, and other functionality is being slowly integrated with community effort.

It is also possible to load the language models directly from the lang sub-module:

After processing a text, words and punctuation are stored in the vocabulary object of nlp:

>>> type(nlp.vocab)spacy.vocab.Vocab

This Vocab is shared between documents, meaning it stores all new words from all docs. In contrast, the doc object’s vocabulary only contains the words from the txt:

>>> type(doc.vocab)spacy.vocab.Vocab

Internally, spaCy communicates in hashes to save memory and has a two-way lookup table called StringStore. You can get the hash of a string or get the string if you have the hash:

>>> type(nlp.vocab.strings)spacy.strings.StringStore>>> nlp.vocab.strings["google"]1988622737398120358>>> nlp.vocab.strings[1988622737398120358]'google'

When tokens go into the Vocab, they lose all their context-specific information. So, when you look up words from the vocab, you are looking up lexemes:

>>> lexeme = nlp.vocab["google"]
>>> type(lexeme)

Lexemes don’t contain context-specific information like part-of-speech tags, morphological dependencies, etc. But they still offer many lexical attributes of the word:

>>> print(lexeme.text, lexeme.orth, lexeme.is_digit)google 1988622737398120358 False

orth attribute is for the hash of the lexeme.

So, if you are looking at a word through the doc object, it is a token. If it is from a Vocab, it is a lexeme.

Now, let’s talk more about the doc object.

Calling nlp object on text generates the doc along with its special attributes.

You can create docs manually by importing the Doc class from tokens module:

Doc requires three arguments – the vocabulary from nlp, a list of words and another list specifying if the words are followed by a space (including the last one). All tokes in doc have this information.

>>> len(doc)4>>> doc.text'I love Barcelona!'

Spans are also a class of their own and expose a range of attributes, even though they are just a view of the doc object:

To create the span objects manually, pass the doc object and start/end indices of the tokens to the Span class:

Named Entity Recognition (NER)

One of the most common tasks in NLP is predicting named entities, like people, locations, countries, brands, etc.

Performing NER is ridiculously easy in spaCy. After processing a text, just extract the ents attribute of the doc object:

Cleopatra is recognized as a PERSON, while Egypt is a Geo-political entity (GPE). To know the meaning of other labels, you can use the explain function:

Instead of printing text, you can use spaCy’s visual entity tagger, available via displacy:

Image by author

The image shows that Alexander the Great isn’t recognized as a PERSON because it is not a common name. But no matter, we can label Alexander as a PERSON manually.

First, extract the full name as a span by giving it a label (PERSON):

Then, update the ents list with the span:

doc.ents = list(doc.ents) + [alexander]

Now, displacy tags it as well:

Image by author

You could’ve set the new entity with set_ents function as well:

# Leaves the rest of ents untouched
doc.set_ents([alexander], default="unmodified")

Predicting part-of-speech (POS) tags and syntactic dependencies

spaCy also offers a rich selection of tools for grammar analysis of a document. Lexical and grammatical attributes of tokens are given as attributes.

For example, let’s take a look at each token’s part-of-speech tag and its syntactic dependency:

The output contains some confusing labels, but we can infer some of them like verbs, adverbs, adjectives, etc. Let’s see the explanation for a few others:

The last column in the previous table represents word relations like “the first”, “first footprints”, “remain there”, etc.

spaCy contains many more powerful features for linguistic analysis. As the last example, here is how you extract noun chunks:

Learn more about linguistic features from this page of the spaCy User Guide.

Custom rule-based tokenization

Until now, spaCy had complete control over tokenization rules. But a language can have many culture-specific idiosyncracies and edge-cases that don’t fit spaCy’s rules. For example, in an earlier example, we saw that “Alexander the Great” was missed as an entity.

If we process the same text again, spaCy regards the entity as three tokens. We need to tell it that titled names like Alexander the Great or Bran the Broken should be considered a single token rather than three because splitting them makes no sense.

Let’s see how to do that by creating custom tokenization rules.

We will start by creating a pattern as a dictionary:

In spaCy, there is a large set of keywords you can use in combination to parse virtually any type of token pattern. For example, the above pattern is for a three-token pattern, with the first and last tokens being an alphanumeric text and the middle one being a stop word (like the, and, or, etc.). In other words, we are matching “Alexander the Great” without explicitly telling it to spaCy.

Now, we will create a Matcher object with this pattern:

After processing the text, we call this matcher object on the doc object, which returns a list of matches. Each match is a tuple with three elements – match ID, start, and end:

You can tweak the pattern in any you want. For example, you can use quantifiers like OP with reGex keywords or use reGex itself:

You can learn more about custom rule-based matching from here.

Word vectors and semantic similarity

Another everyday use case of NLP is predicting semantic similarity. Similarity scores can be used in recommender systems, plagiarism, duplicate content, etc.

spaCy calculates semantic similarity using word vectors (explained below), which are available in the medium-sized model:

All doc, token and span objects have this similarity method:

The three classes can be compared to each other as well, like a token to a span:

>>> doc1[0:2].similarity(doc[3])0.8700238466262817

The similarity is calculated using word vectors, which are multi-dimensional mathematical representations of words. For example, here is the vector of the first token in the document:

All about pipelines

Under the hood, language models aren’t one pipeline but a collection:

When a text is processed with nlp, it is first tokenized and passed down to each pipeline from the above list, which, in turn, modifies and returns the Doc object with new information. Here is how it is illustrated in the Spacy docs:

Image from the spaCy docs

But, spaCy doesn’t have a custom pipeline for any NLP problem you might face in the real world. For example, you may want to perform your preprocessing steps before the text is passed to other pipelines or write callbacks to extract different types of information between pipelines.

For such cases, you should learn how to write custom pipeline functions and add them to the nlp object, so they are automatically run when you call nlp on text.

Here is a basic overview of how to do this:

You need the general Language class and decorate its component method over your function. The custom function must accept and return a single argument for the doc object. Above, we are defining a simple pipeline that prints out the length of the doc object.

Let’s add it to the nlp object:

As you can see, the custom pipeline is added to the end. Now, we will call nlp on a sample text:

>>> doc = nlp("Bird dies, but you remember the flight.")There are 9 tokens in this text.

Working as expected.

Now, let’s do a more serious example. We will go back to “Alexander the Great” example and write a pipeline that adds the conqueror’s name as an entity. And we want all this happen automatically without finding the entity outside the pipeline.

Here is the complete code:

We define the pattern, create a matcher object, add the matches to the entities, and return the doc. Let’s test it:

It is working as expected.

So far, we have been appending custom pipelines to the end, but we can control this behavior. The add_pipe function has arguments to specify precisely where you want to insert the function:


Today, you have taken bold steps towards mastering the Scikit-learn of natural language processing. Armed with the article’s knowledge, you can now roam freely across spaCy’s user guide, which is just as large and information-rich as Scikit-learn’s.

Thank you for reading!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: