Original Source Here
Preprocessing Textual Data
Using Cleantext for cleaning text dataset
If you’ve ever worked on textual datasets, you must be aware of the garbage that comes with text data. In order to clean this data, we perform certain preprocessing which helps in cleaning and manipulating the data. Preprocessing is an important step because it helps in passing the correct data to the model so that the model can work according to the requirements.
There are certain python libraries that are helpful in performing the preprocessing of the text dataset. One such library is Cleantext, which is an open-source python module i.e, use to clean and preprocess the text data to create a normalized text representation.
In this article, we will explore Cleantext and its different functionalities.
Let’s get started…
Installing required libraries
We will start by installing a Cleantext library by using pip. The command given below will do that.
!pip install cleantext
Importing required libraries
In this step, we will import the required libraries for cleaning and preprocessing the dataset. Cleantext requires NLTK at the backend so we will import NLTK also.
Preprocessing the data
Now we will clean the data using Cleantext. We will explore both the options that are cleaning a data file or cleaning a sentence.
cleantext.clean('Himanshu+-= S$harma WelC@omes!!! you to 123medium', extra_spaces=True, lowercase=True, numbers=True, punct=True)
Now let us see how we can clean a text file. For this, we will import a text file and read it to perform preprocessing.
file = open("/content/data.txt", 'rt')
text = file.read()
cleantext.clean(text, all= True)
Similarly, we can also perform a cleaning of the words in the sentence. The code given below will perform a cleaning of the words. To perform the cleaning of the words we can use certain parameters as you will see in the code below.
cleantext.clean_words('Himanshu+-= S$harma WelC@omes!!! you to 123medium',
all= False, # Execute all cleaning operations
extra_spaces=True , # Remove extra white space
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
stp_lang='english' # Language for stop words
If we want to set all these parameters to true we can do that by setting all parameter to true as given in the code below.
cleantext.clean_words('Himanshu+-= S$harma WelC@omes!!! you to 123medium', all=True)
Here you can see how we cleaned text, sentences, and words using Cleantext. This can be helpful while creating an NLP model because we can use the cleaned text which will not only increase the performance but will also help in achieving higher accuracy.
Go ahead try this with different datasets and perform preprocessing using Cleantext. In case you find any difficulty please let me know in the response section.
This article is in collaboration with Piyush Ingale.
Before You Go
Thanks for reading! If you want to get in touch with me, feel free to reach me at firstname.lastname@example.org or my LinkedIn Profile. You can view my Github profile for different data science projects and packages tutorials. Also, feel free to explore my profile and read different articles I have written related to Data Science.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot