Natural Language Process for Judicial Sentences with Python

Original Source Here

Natural Language Process for Judicial Sentences with Python

In this article, I’m going to perform Latent Semantic (or Topic) Analysis, a technique that analyzes relationships between documents and relative terms by producing new “concepts” (also called semantics or topics) related to documents and terms. This technique is founded the distributional hypothesis, according to which “linguistic terms with similar distributions have similar meaning”. As a consequence, terms close in meaning will be likely to occur in similar pieces of text.

Latent Semantic Analysis consists of two main steps:

1. Creating a Word-Document Matrix (you can read more about document vectorization here)
2. Reducing the dimensionality of that matrix in order to produce new variables (that will be our semantics or topics).

I will perform two matrix decomposition techniques: Singular Value Decomposition and Non-Negative Matrix Factorization. They both rely on the assumption that, in a high-dimensional features space, a smaller number of dimensions might be actually needed to explain the variation in the data (you can read more about the concept of dimensionality reduction here →https://towardsdatascience.com/pca-eigenvectors-and-eigenvalues-1f968bc6777a).

The main difference between the two approaches is that SVD decomposes the vectorized matrix of documents X into 3 lower-dimensional matrices, while NMF does that with only 2 matrices.

Let’s implement them in Python.

2-Dimensions Analysis

`#create the original matrix X (term-document matrix), vectorized with tf-idf weights.documents = df_factor.Tokens.apply(str).tolist()tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', analyzer='word',                                    min_df=0.001, max_df=0.5, sublinear_tf=True, use_idf=True)X = tfidf_vectorizer.fit_transform(documents)from sklearn.decomposition import TruncatedSVD #svd# set number of latent componentsk = 10svd = TruncatedSVD(n_components=k)%time U = svd.fit_transform(X)S = svd.singular_values_V = svd.components_from sklearn.decomposition import NMF #nmfnmf = NMF(n_components=k, init='nndsvd', random_state=0)%time W = nmf.fit_transform(X)H = nmf.components_print("SVD matrices shapes: ", U.shape, S.shape, V.shape)print("NMF matrices shapes: ",W.shape, H.shape)Wall time: 2.5 sWall time: 11.4 sSVD matrices shapes:  (13087, 10) (10,) (10, 40808)NMF matrices shapes:  (13087, 10) (10, 40808)import numpy as npdef show_topics(A, vocabulary, topn=5):    """    find the top N words for each of the latent dimensions (=rows) in a    """    topic_words = ([[vocabulary[i] for i in np.argsort(t)[:-topn-1:-1]]                    for t in A])    return [', '.join(t) for t in topic_words]`

Now, let’s print the top terms from the vectorized document matrix, weighted by TF-IDF scores (you can read more about this score in the previous part here).

`#SVDterms = tfidf_vectorizer.get_feature_names()sorted(show_topics(V, terms))`
`['antitrust, antitrust division, bid, rigging, bid rigging', 'child, criminal division, safe childhood, project safe, childhood', 'child, safe childhood, project safe, childhood, exploitation', 'epa, environmental, clean, environment, natural', 'injunction, customers, complaint, preparing, preparers', 'medicare, hhs, health, health care, care', 'osc, ina, immigration, citizenship, discrimination provision', 'rights, civil rights, rights division, discrimination, employment', 'tax, fraud, false, prison, irs', 'tax, returns, irs, tax returns, tax division']`
`#NMFsorted(show_topics(H, terms))`
`['antitrust, antitrust division, bid, rigging, bid rigging', 'child, safe childhood, project safe, childhood, exploitation', 'epa, environmental, clean, environment, natural', 'false claims, claims act, claims, civil division, health', 'fbi, indictment, police, security, law', 'medicare, hhs, health, health care, care', 'osc, ina, employment, citizenship, anti discrimination', 'rights, civil rights, rights division, civil, discrimination', 'tax, irs, tax division, returns, tax returns', 'tax, returns, customers, injunction, tax returns']`

Now let’s plot our lower lowered-dimension term matrices.

`#Initializing a plotting functionimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom matplotlib import colorsimport seaborn as snssns.set_context('notebook')def plot_vectors(vectors, V, title='VIZ', labels=None, dimensions=3):    """    plot the vectors in 2 or 3 dimensions.     If labels are supplied, use them to color the data accordingly    """    # set up graph    fig = plt.figure(figsize=(10,10))    # create data frame    df = pd.DataFrame(data={'x':vectors[:,0], 'y': vectors[:,1]})    # add labels, if supplied    if labels is not None:        df['label'] = labels    else:        df['label'] = [''] * len(df)    # assign colors to labels    cm = plt.get_cmap('tab20b') # choose the color palette    n_labels = len(df.label.unique())    label_colors = [cm(1. * i/n_labels) for i in range(n_labels)]    cMap = colors.ListedColormap(label_colors)    # plot in 3 dimensions    if dimensions == 3:        # add z-axis information        df['z'] = vectors[:,2]        # define plot        ax = fig.add_subplot(111, projection='3d')        frame1 = plt.gca()         # remove axis ticks        frame1.axes.xaxis.set_ticklabels([])        frame1.axes.yaxis.set_ticklabels([])        frame1.axes.zaxis.set_ticklabels([])        # plot each label as scatter plot in its own color        for l, label in enumerate(df.label.unique()):            df2 = df[df.label == label]            color_values = [label_colors[l]] * len(df2)            ax.scatter(df2['x'], df2['y'], df2['z'],                        c=color_values,                        cmap=cMap,                        edgecolor=None,                        label=label,                        alpha=0.4,                        s=100)        topics = sorted(show_topics(V.components_, tfidf_vectorizer.get_feature_names()))        print(topics)        frame1.axes.set_xlabel(topics[0])        frame1.axes.set_ylabel(topics[1])        frame1.axes.set_zlabel(topics[2])    # plot in 2 dimensions    elif dimensions == 2:        ax = fig.add_subplot(111)        frame1 = plt.gca()         frame1.axes.xaxis.set_ticklabels([])        frame1.axes.yaxis.set_ticklabels([])        for l, label in enumerate(df.label.unique()):            df2 = df[df.label == label]            color_values = [label_colors[l]] * len(df2)            ax.scatter(df2['x'], df2['y'],                        c=color_values,                        cmap=cMap,                        edgecolor=None,                        label=label,                        alpha=0.4,                        s=100)        topics = sorted(show_topics(V.components_, tfidf_vectorizer.get_feature_names()))        print(topics)        frame1.axes.set_xlabel(topics[0])        frame1.axes.set_ylabel(topics[1])    else:        raise NotImplementedError()    plt.legend(ncol = 5, loc = "upper left", frameon = True, fancybox = True)    ax.legend(frameon = True, ncol = 2, fancybox = True, title_fontsize = 15,              loc = 'center left', bbox_to_anchor = (1, 0.5), labelspacing = 2.5, borderpad = 2)    plt.title(title)#     plt.legend()    plt.show()`
`# now let's perform the same computations with 2 and 3 dimensions, so that we can visualize them. Let's start with 2 dims.low_dim_svd = TruncatedSVD(n_components = 2)low_dim_U = low_dim_svd.fit_transform(X)sorted(show_topics(low_dim_svd.components_, tfidf_vectorizer.get_feature_names()))`
`['medicare, hhs, health, health care, care', 'tax, fraud, false, prison, irs']`
`low_dim_nmf = NMF(n_components=2, init='nndsvd')low_dim_W = low_dim_nmf.fit_transform(X)sorted(show_topics(low_dim_nmf.components_, tfidf_vectorizer.get_feature_names()))`
`['medicare, health, health care, care, hhs', 'tax, irs, returns, prison, tax division']`

It seems that those results are consistent between the 2 methods. Let’s have a look at the plotting:

`plot_vectors(low_dim_U, low_dim_svd, title = 'SVD 2d', dimensions=2)`
`plot_vectors(low_dim_W, low_dim_nmf, title = 'NMF 2d', dimensions=2)`

Now I want to perform the same analysis but taking into account also the labels (or categories) of my articles.

`#creating a df with records with only one labeldata_single = df_factor.copy()[df_factor[categories].sum(axis = 1) == 1]documents = data_single.text.apply(str).tolist()tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english', analyzer='word',                                    min_df=0.001, max_df=0.5, sublinear_tf=True, use_idf=True)X = tfidf_vectorizer.fit_transform(documents)from sklearn.decomposition import TruncatedSVD # this also works with sparse matriceslabels = [i[0] for i in data_single.category] #each component is a list, I want a list of elements not a list of lists# set number of latent componentsk = 10svd = TruncatedSVD(n_components=k)%time U = svd.fit_transform(X)S = svd.singular_values_V = svd.components_from sklearn.decomposition import NMFnmf = NMF(n_components=k, init='nndsvd', random_state=0)%time W = nmf.fit_transform(X)H = nmf.components_`

For this analysis, I will rely only on the NMF method. Indeed, I noticed that topics extracted with SVD are very similar:

`low_dim_svd = TruncatedSVD(n_components = 2)low_dim_U = low_dim_svd.fit_transform(X)sorted(show_topics(low_dim_svd.components_, tfidf_vectorizer.get_feature_names()))['tax, irs, tax division, fraud, returns', 'tax, returns, tax division, irs, tax returns']`

In order to have latent topics to cover as much information as possible, I don’t want two or more components to bring the same piece of information and so be redundant.

`low_dim_nmf = NMF(n_components=2, init='nndsvd')low_dim_W = low_dim_nmf.fit_transform(X)plot_vectors(low_dim_W, low_dim_nmf, labels = labels, title = 'NMF 2d', dimensions=2)`

3-Dimensions Analysis

Now let’s do the same with 3 dimensions. Also, in this case, I will only rely on the NMF method. Indeed, looking at the second and third latent topics (or components) of the SVD method, I noticed those are very similar:

`['medicare, hhs, health, health care, care', 'tax, fraud, false, prison, irs', 'tax, returns, irs, tax returns, tax division']`

Hence, for the same reasons explained above, I will not proceed further with SVD method.

So let’s proceed with creating the 3D matrix and plotting the results:

`low_dim_nmf = NMF(n_components=3, init='nndsvd')low_dim_W = low_dim_nmf.fit_transform(X)plot_vectors(low_dim_W, low_dim_nmf, title = 'NMF 3d', dimensions=3)`

Also with the 3d analysis, I want to add information to my plots by taking into account also the labels (or categories) of my articles.

`low_dim_nmf = NMF(n_components=3, init='nndsvd')low_dim_W = low_dim_nmf.fit_transform(X)sorted(show_topics(low_dim_nmf.components_, tfidf_vectorizer.get_feature_names()))plot_vectors(low_dim_W, low_dim_nmf, labels = labels, title = 'NMF 3d', dimensions=3)`

Conclusions

From the analyses above, it seems that our methods were able to cluster words belonging to documents with the same label into the same latent topic. This could be helpful in search engines or automated judicial sentences categorization and, in general, to facilitate the navigation across the judicial knowledge base.

In the next article, will go ahead working with topics and semantics, so stay tuned for Part 5!

References

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot