Natural Language Processing

https://miro.medium.com/max/1200/0*DPVs2zdiLVt2Ax1U

Original Source Here

Stemming and Lemmatization in NLP with Python

Techniques in natural language processing to analyze text

Photo by Micah Boswell on Unsplash

In this article, we will study stemming and lemmatization in natural language processing techniques. These two topics are very much important to do pre-processing of text data and do refinement for analysis.

  • Stemming: It is a process in which the words with suffixes are reduced to their root word. It is an important pipeline process in NLP. The words like ‘happiness’, ‘happiest’, ‘happier’ belong to the root word i.e. ‘happy’.
  • Lemmatization: It is also a process that reduces the word to its root meaning but with additional features like adding part of speech (POS).

Both techniques are used in the text mining process to pull out the best information from the text.

The stemming process is easy and fast and mainly depends on the pattern-based approach. While in the lemmatization it is used to get more detailed information from the text.

The python user-friendly library for NLP tasks is Spacy, NLTK, TextBlob.

The spacy doesn’t include the stemming process, so we choose other libraries for this process. We will try to use other libraries for these two methods.

The stemming is provided with two main libraries, the first one is porter stemmer and the second one is snowball stemmer. The snowball stemmer is a better version of porter stemmer to get more precise stem root words.

Let’s see the python example for stemming and lemmatization with the NLTK library.

Stemming

In this method, we will talk about porter stemming. First, we need to import the NLTk library download the ‘punkt’ i.e. a sentence tokenizer. Then import the porter stemmer to process the stemming from the sentence. The stemming is a process to change the word variation while keeping the same meaning.

import nltk
nltk.download('punkt')
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

Now, we will take the sentence as an example for stemming.

text = "Fruits are one of the essential crops among all, which have all nutrients"# Grab the tokens from the sentence
tokens = nltk.word_tokenize(text)
#To do stemming on these tokens
for w in tokens:
print ("Actual: %s ---- Stem: %s" % (w,porter_stemmer.stem(w)))

This part of code will to stemming on all the tokens grabbed in the tokens variable.

#output:
Actual: Fruits ---- Stem: fruit
Actual: are ---- Stem: are
Actual: one ---- Stem: one
Actual: of ---- Stem: of
Actual: the ---- Stem: the
Actual: essential ---- Stem: essenti
Actual: crops ---- Stem: crop
Actual: among ---- Stem: among
Actual: all ---- Stem: all
Actual: , ---- Stem: ,
Actual: which ---- Stem: which
Actual: have ---- Stem: have
Actual: all ---- Stem: all
Actual: nutrients ---- Stem: nutrient

The another example of the porter stemmer is shown below:

from nltk.stem import PorterStemmer
words= ["rest", "resting", "rests", "restful"]
ps =PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)
#output:
rest
rest
rest
rest

Lemmatization

With the same text example, we will do the lemmatization and notice the difference in both the method.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
text = "Fruits are one of the essential crops among all, which have all nutrients"tokens = nltk.word_tokenize(text)
for w in tokens:
print ("Actual: %s Lemma: %s" % (w,lemma.lemmatize(w)))

The output of the lemmatization is shown below:

#output:
Actual: Fruits --- Lemma: Fruits
Actual: are --- Lemma: are
Actual: one --- Lemma: one
Actual: of --- Lemma: of
Actual: the --- Lemma: the
Actual: essential --- Lemma: essential
Actual: crops --- Lemma: crop
Actual: among --- Lemma: among
Actual: all --- Lemma: all
Actual: , --- Lemma: ,
Actual: which --- Lemma: which
Actual: have --- Lemma: have
Actual: all --- Lemma: all
Actual: nutrients --- Lemma: nutrient

If we compare the results of these two, then we will see clearly that the lemmatization gives the two different lemmas that are bold in the above output.

The lemmatization is more preferable in the business domain because of its more reliable and advanced version than stemming.

from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

print("days :", lem.lemmatize("days"))
print("bottles :", lem.lemmatize("bottles"))

# The pos parameter have 'a' means adjective
print("happier :", lem.lemmatize("happier", pos ="a"))
#output:
days : day
bottles : bottle
happier : happy

Conclusion

This article gives a basic idea about the two methods in natural language processing. NLTK library is a wonderful library to analyze with text data.

I hope you like the article. Reach me on my LinkedIn and twitter.

Recommended Articles

1. Python: Zero to Hero with Examples
2. Python Data Structures Data-types and Objects
3. Exception Handling Concepts in Python
4. Reading CSV(), Excel(), JSON () and HTML() File Formats in Pandas
5. Neural Networks: The Rise of Recurrent Neural Networks
6. Fully Explained Linear Regression with Python
7. Fully Explained Logistic Regression with Python
8. Differences Between concat(), merge() and join() with Python
9. Lasso (l1) and Ridge (l2) Regularization Techniques
10. Confusion Matrix in Machine Learning

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: