Topic Modelling on NYT articles using Gensim, LDA*6LqRJqfgGafAYgGA

Original Source Here

3. Implementing LDA

The dataset used in this case study is the Newyork times article data from the year 2020. This dataset contains around 69k data points.

A snapshot of the data is presented in the image below.

Snapshot of article data (Image by author)

The column ‘year’ can be ignored as we are using the data from the year 2020 alone. The columns ‘sentence’ has the data of the article at the sentence level.

We will first preprocess the sentence column and then apply the LDA model from Gensim on this preprocessed data.

a. Preprocess the data

Preprocessing text data is not a single-step process as it contains redundant and/or repetitive words which have to go through lots of cleaning. Also, this process is sometimes dependent on the objective we are looking for.

This phase involves the deletion of words or characters, that do not add value to the meaning of the text. Let’s discuss each below:

  • Lowering the case of text is essential for the following reasons:
    The words, ‘elections’, ‘Elections’, and ‘ELECTIONS’ all add the same value to a sentence. Lowering the case of all the words helps to reduce the dimensions by decreasing the size of the vocabulary.
  • Removing any punctuation marks will help to treat words like ‘hurray’ and ‘hurray!’ in the same way.
  • Stopwords are commonly occurring words in a language, such as ‘the’, ‘a’, ‘an’, ‘is’. We can remove them here because they won’t provide any valuable information for our analysis. Also removing stop words reduces the dimension of data.
  • In this case study, as we are looking for keywords that make a topic, numbers wouldn’t add any value. Hence, we are removing numbers from the data.

We can perform all the above using Gensim. Gensim provides a function, preprocess_string, which provides the most widely used preprocessing techniques on text data. The default techniques (filters) that this function provides are as follows:

  1. strip_tags(),
  2. strip_punctuation(),
  3. strip_multiple_whitespaces(),
  4. strip_numeric(),
  5. remove_stopwords(),
  6. strip_short(),
  7. stem_text()

Using the below single line of code, we can perform preprocess using the default filters on the entire text data.


b. Creation of dictionary and corpus

Let us create a dictionary and bag of word corpus to pass as input to the model.

  • dictionary: Collection of all the unique words
  • bow_corpus: Each word in the entire corpus is converted into bag of words. Bag of Words is a simple transformation on the document to a vector by using a dictionary of unique words where we can get the frequency of each word.


Snapshot of dictionary and bow_corpus (Image by author)

c. LDA model

We pass the dictionary and bow_corpus to the LDA model provided by Gensim, which does the topic modelling of each sentence.

Below is the line of code used to perform topic modelling using LDA.


d. Visualizing the results

There are many methods that can be used to visualize the results like tSNE, word clouds, bar charts. We will use pyLDAvis to visualize our LDA model as it is one of the most interactive tools to look at the results.

Below is the code used to produce the visualization of the results from the LDA model.



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: