Natural Language Processing

https://miro.medium.com/max/1200/0*tEQ_oY6Cf2cLf9r4

Original Source Here

Building Bag of Words (BOW) Model from Scratch in NLP

Text representation model in natural language processing

Photo by Brett Jordan on Unsplash

In this article, we will discuss bag of words (BOW) model building in natural language processing. Sometimes, we try to find the occurrence of the words in the text document and we try with a simple count method to search the count of the one word. But if we want to know the occurrence of each word in the text document and with its count then we use the bag of words method also known as word embeddings.

The bag of words is used to extract the information from the text and trying to make them a dictionary or histogram with its word counts.

Let’s have an example with this sentence below:

Sentence 1: “Thank you so much for your help.”
Sentence 2: “ Your most welcome.”

In these sentences, we first break them into tokens and remove all punctuation and symbols from the sentences to make a model to further analysis with algorithms.

To make a matrix with word counts have simple steps.

Step 1: Make tokens from the sentences
Step 2: Lower all words
Step 3: make a matrix or dictionary with word counts
Word counts. A photo by Author

In the above photo, we can see clearly that the word counts from both sentences 1 and 2 are shown with their occurrence number in the sentence.

Now, we will do practically to find the bag of words from the scratch.

Install the NLTK library and others used in natural language processing

import nltk
import re

Now, we will read the text data example.

text = """Machine learning and robotic vision system combination
opens a gateway for precision agriculture and enhances the
quality of fruit harvesting applications. Study describes
various challenges in the fruit agricultural process like
weather, illumination variation and occlusion, etc..
Currently fruit harvesting processes are manual with semi-
advanced machinery. Many approaches use images that
contain fruits to make more advanced vision system for
yield estimation analysis."""

It’s time to make tokens from the text.

token_data = nltk.sent_tokenize(text)
token_data
#output:
['Machine learning and robotic vision system combination opens a gateway for precision \n agriculture and enhances the quality of fruit harvesting applications. Study describes \n various challenges in the fruit agricultural process like weather, illumination variation \n and occlusion, etc..',
'Currently fruit harvesting processes are manual with semi-advanced \n machinery.',
'Many approaches use images that contain fruits to make more advanced vision system \n for yield estimation analysis.']

In the output, it is clear that there are some impurities and punctuation in the output and we need to remove them.

We also need to find the length of the tokens.

len(token_data)#output:
3

It means that in the raw data there are 3 sentences.

To remove these we need a regular expression (re) library that we imported in the starting.

#to remove the extra non-wordsfor i in range(len(token_data)):
token_data[i] = token_data[i].lower()
token_data[i] = re.sub(r'\W',' ', token_data[i])
token_data[i] = re.sub(r'\s+',' ', token_data[i])
#output:
['machine learning and robotic vision system combination opens a gateway for precision agriculture and enhances the quality of fruit harvesting applications study describes various challenges in the fruit agricultural process like weather illumination variation and occlusion etc ','currently fruit harvesting processes are manual with semi advanced machinery ','many approaches use images that contain fruits to make more advanced vision system for yield estimation analysis ']

Now, we got more clear data and now we will make the histogram or dictionary of these words.

#creating the histogram or dictionary
word2count = {}
for data in token_data:
words = nltk.word_tokenize(data)
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
word2count#output:
{'machine': 1,
'learning': 1,
'and': 3,
'robotic': 1,
'vision': 2,
'system': 2,
'combination': 1,
'opens': 1,
'a': 1,
...
...
'makemore': 1,
'yield': 1,
'estimation': 1,
'analysis': 1}

As we see, this is a small example but if we have to do text classification and make a model on thousand of words then we have to extract the most frequent words out of all the words.

The total length of the word counts is shown below:

len(word2count)#output:
52

To extract the first 20 most frequent words out of 52 words, we use heapq library.

import heapqfreq_words = heapq.nlargest(20, word2count, key = word2count.get)
freq_words
#output:
['and','fruit','vision','system','for','the','harvesting',
'advanced','machine','learning','robotic','combination','opens',
'a','gateway','precision','agriculture','enhances','quality','of']

Now, we got the list of the top 20 most frequent words in the text document.

We will try to match the freq_words with raw data to make a bag of words model.

bow = []
for data in text:
vector = []
for word in freq_words:
if word in nltk.word_tokenize(data):
vector.append(1)
else:
vector.append(0)
bow.append(vector)
bow#output:
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
.........
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]]

The bag of words model is a list of lists. But we need to make it into an array matrix for our algorithm.

import numpy as npbow = np.asarray(bow)bow#output:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])

Now, this output is a two-dimensional array and useful for further modeling.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: