Natural Language Process for Judicial Sentences with Python



Original Source Here

https://pixabay.com/

Natural Language Process for Judicial Sentences with Python

In this article, I’m going to perform Latent Semantic (or Topic) Analysis, a technique that analyzes relationships between documents and relative terms by producing new “concepts” (also called semantics or topics) related to documents and terms. This technique is founded the distributional hypothesis, according to which “linguistic terms with similar distributions have similar meaning”. As a consequence, terms close in meaning will be likely to occur in similar pieces of text.

Latent Semantic Analysis consists of two main steps:

  1. Creating a Word-Document Matrix (you can read more about document vectorization here)
  2. Reducing the dimensionality of that matrix in order to produce new variables (that will be our semantics or topics).

I will perform two matrix decomposition techniques: Singular Value Decomposition and Non-Negative Matrix Factorization. They both rely on the assumption that, in a high-dimensional features space, a smaller number of dimensions might be actually needed to explain the variation in the data (you can read more about the concept of dimensionality reduction here →https://towardsdatascience.com/pca-eigenvectors-and-eigenvalues-1f968bc6777a).

The main difference between the two approaches is that SVD decomposes the vectorized matrix of documents X into 3 lower-dimensional matrices, while NMF does that with only 2 matrices.

Let’s implement them in Python.

2-Dimensions Analysis

#create the original matrix X (term-document matrix), vectorized with tf-idf weights.

documents = df_factor.Tokens.apply(str).tolist()
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', analyzer='word',
min_df=0.001, max_df=0.5, sublinear_tf=True, use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)

from sklearn.decomposition import TruncatedSVD #svd

# set number of latent components
k = 10

svd = TruncatedSVD(n_components=k)
%time U = svd.fit_transform(X)
S = svd.singular_values_
V = svd.components_

from sklearn.decomposition import NMF #nmf

nmf = NMF(n_components=k, init='nndsvd', random_state=0)

%time W = nmf.fit_transform(X)
H = nmf.components_

print("SVD matrices shapes: ", U.shape, S.shape, V.shape)
print("NMF matrices shapes: ",W.shape, H.shape)

Wall time: 2.5 s
Wall time: 11.4 s
SVD matrices shapes: (13087, 10) (10,) (10, 40808)
NMF matrices shapes: (13087, 10) (10, 40808)

import numpy as np
def show_topics(A, vocabulary, topn=5):
"""
find the top N words for each of the latent dimensions (=rows) in a
"""
topic_words = ([[vocabulary[i] for i in np.argsort(t)[:-topn-1:-1]]
for t in A])
return [', '.join(t) for t in topic_words]

Now, let’s print the top terms from the vectorized document matrix, weighted by TF-IDF scores (you can read more about this score in the previous part here).

#SVD
terms = tfidf_vectorizer.get_feature_names()

sorted(show_topics(V, terms))
['antitrust, antitrust division, bid, rigging, bid rigging',
'child, criminal division, safe childhood, project safe, childhood',
'child, safe childhood, project safe, childhood, exploitation',
'epa, environmental, clean, environment, natural',
'injunction, customers, complaint, preparing, preparers',
'medicare, hhs, health, health care, care',
'osc, ina, immigration, citizenship, discrimination provision',
'rights, civil rights, rights division, discrimination, employment',
'tax, fraud, false, prison, irs',
'tax, returns, irs, tax returns, tax division']
#NMF
sorted(show_topics(H, terms))
['antitrust, antitrust division, bid, rigging, bid rigging',
'child, safe childhood, project safe, childhood, exploitation',
'epa, environmental, clean, environment, natural',
'false claims, claims act, claims, civil division, health',
'fbi, indictment, police, security, law',
'medicare, hhs, health, health care, care',
'osc, ina, employment, citizenship, anti discrimination',
'rights, civil rights, rights division, civil, discrimination',
'tax, irs, tax division, returns, tax returns',
'tax, returns, customers, injunction, tax returns']

Now let’s plot our lower lowered-dimension term matrices.

#Initializing a plotting function

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import colors
import seaborn as sns

sns.set_context('notebook')

def plot_vectors(vectors, V, title='VIZ', labels=None, dimensions=3):
"""
plot the vectors in 2 or 3 dimensions.
If labels are supplied, use them to color the data accordingly
"""
# set up graph
fig = plt.figure(figsize=(10,10))

# create data frame
df = pd.DataFrame(data={'x':vectors[:,0], 'y': vectors[:,1]})
# add labels, if supplied
if labels is not None:
df['label'] = labels
else:
df['label'] = [''] * len(df)

# assign colors to labels
cm = plt.get_cmap('tab20b') # choose the color palette
n_labels = len(df.label.unique())
label_colors = [cm(1. * i/n_labels) for i in range(n_labels)]
cMap = colors.ListedColormap(label_colors)

# plot in 3 dimensions
if dimensions == 3:
# add z-axis information
df['z'] = vectors[:,2]
# define plot
ax = fig.add_subplot(111, projection='3d')
frame1 = plt.gca()
# remove axis ticks
frame1.axes.xaxis.set_ticklabels([])
frame1.axes.yaxis.set_ticklabels([])
frame1.axes.zaxis.set_ticklabels([])

# plot each label as scatter plot in its own color
for l, label in enumerate(df.label.unique()):
df2 = df[df.label == label]
color_values = [label_colors[l]] * len(df2)
ax.scatter(df2['x'], df2['y'], df2['z'],
c=color_values,
cmap=cMap,
edgecolor=None,
label=label,
alpha=0.4,
s=100)

topics = sorted(show_topics(V.components_, tfidf_vectorizer.get_feature_names()))
print(topics)
frame1.axes.set_xlabel(topics[0])
frame1.axes.set_ylabel(topics[1])
frame1.axes.set_zlabel(topics[2])

# plot in 2 dimensions
elif dimensions == 2:
ax = fig.add_subplot(111)
frame1 = plt.gca()
frame1.axes.xaxis.set_ticklabels([])
frame1.axes.yaxis.set_ticklabels([])

for l, label in enumerate(df.label.unique()):
df2 = df[df.label == label]
color_values = [label_colors[l]] * len(df2)
ax.scatter(df2['x'], df2['y'],
c=color_values,
cmap=cMap,
edgecolor=None,
label=label,
alpha=0.4,
s=100)
topics = sorted(show_topics(V.components_, tfidf_vectorizer.get_feature_names()))
print(topics)
frame1.axes.set_xlabel(topics[0])
frame1.axes.set_ylabel(topics[1])

else:
raise NotImplementedError()
plt.legend(ncol = 5, loc = "upper left", frameon = True, fancybox = True)
ax.legend(frameon = True, ncol = 2, fancybox = True, title_fontsize = 15,
loc = 'center left', bbox_to_anchor = (1, 0.5), labelspacing = 2.5, borderpad = 2)
plt.title(title)
# plt.legend()
plt.show()

# now let's perform the same computations with 2 and 3 dimensions, so that we can visualize them. Let's start with 2 dims.

low_dim_svd = TruncatedSVD(n_components = 2)
low_dim_U = low_dim_svd.fit_transform(X)
sorted(show_topics(low_dim_svd.components_, tfidf_vectorizer.get_feature_names()))
['medicare, hhs, health, health care, care', 'tax, fraud, false, prison, irs']
low_dim_nmf = NMF(n_components=2, init='nndsvd')
low_dim_W = low_dim_nmf.fit_transform(X)
sorted(show_topics(low_dim_nmf.components_, tfidf_vectorizer.get_feature_names()))
['medicare, health, health care, care, hhs',
'tax, irs, returns, prison, tax division']

It seems that those results are consistent between the 2 methods. Let’s have a look at the plotting:

plot_vectors(low_dim_U, low_dim_svd, title = 'SVD 2d', dimensions=2)
plot_vectors(low_dim_W, low_dim_nmf, title = 'NMF 2d', dimensions=2)

Now I want to perform the same analysis but taking into account also the labels (or categories) of my articles.

#creating a df with records with only one label
data_single = df_factor.copy()[df_factor[categories].sum(axis = 1) == 1]


documents = data_single.text.apply(str).tolist()
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', analyzer='word',
min_df=0.001, max_df=0.5, sublinear_tf=True, use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)

from sklearn.decomposition import TruncatedSVD # this also works with sparse matrices
labels = [i[0] for i in data_single.category] #each component is a list, I want a list of elements not a list of lists
# set number of latent components
k = 10

svd = TruncatedSVD(n_components=k)
%time U = svd.fit_transform(X)
S = svd.singular_values_
V = svd.components_

from sklearn.decomposition import NMF

nmf = NMF(n_components=k, init='nndsvd', random_state=0)

%time W = nmf.fit_transform(X)
H = nmf.components_

For this analysis, I will rely only on the NMF method. Indeed, I noticed that topics extracted with SVD are very similar:

low_dim_svd = TruncatedSVD(n_components = 2)
low_dim_U = low_dim_svd.fit_transform(X)
sorted(show_topics(low_dim_svd.components_, tfidf_vectorizer.get_feature_names()))

['tax, irs, tax division, fraud, returns',
'tax, returns, tax division, irs, tax returns']

In order to have latent topics to cover as much information as possible, I don’t want two or more components to bring the same piece of information and so be redundant.

low_dim_nmf = NMF(n_components=2, init='nndsvd')
low_dim_W = low_dim_nmf.fit_transform(X)
plot_vectors(low_dim_W, low_dim_nmf, labels = labels, title = 'NMF 2d', dimensions=2)

3-Dimensions Analysis

Now let’s do the same with 3 dimensions. Also, in this case, I will only rely on the NMF method. Indeed, looking at the second and third latent topics (or components) of the SVD method, I noticed those are very similar:

['medicare, hhs, health, health care, care',
'tax, fraud, false, prison, irs',
'tax, returns, irs, tax returns, tax division']

Hence, for the same reasons explained above, I will not proceed further with SVD method.

So let’s proceed with creating the 3D matrix and plotting the results:

low_dim_nmf = NMF(n_components=3, init='nndsvd')
low_dim_W = low_dim_nmf.fit_transform(X)

plot_vectors(low_dim_W, low_dim_nmf, title = 'NMF 3d', dimensions=3)

Also with the 3d analysis, I want to add information to my plots by taking into account also the labels (or categories) of my articles.


low_dim_nmf = NMF(n_components=3, init='nndsvd')
low_dim_W = low_dim_nmf.fit_transform(X)
sorted(show_topics(low_dim_nmf.components_, tfidf_vectorizer.get_feature_names()))
plot_vectors(low_dim_W, low_dim_nmf, labels = labels, title = 'NMF 3d', dimensions=3)

Conclusions

From the analyses above, it seems that our methods were able to cluster words belonging to documents with the same label into the same latent topic. This could be helpful in search engines or automated judicial sentences categorization and, in general, to facilitate the navigation across the judicial knowledge base.

In the next article, will go ahead working with topics and semantics, so stay tuned for Part 5!

References

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: