Build a Plagiarism Checker Using Machine Learning

https://miro.medium.com/max/1200/0*MdRlzVdniZe8r5qd

Original Source Here

Demo App Code Walkthrough

We’ve gone through the inner workings of the app, but how did we actually build it? As noted earlier, this is a Python Flask app that utilizes the Pinecone SDK. The HTML uses a template file, and the rest of the frontend is built using static CSS and JS assets. To keep things simple, all of the backend code is found in the app.py file, which we’ve reproduced in full below:

Let’s go over the important parts of the app.py file so that we understand it.

On lines 1–14, we import our app’s dependencies. Our app relies on the following:

  • dotenv for reading environment variables from the .env file
  • flask for the web application setup
  • json for working with JSON
  • os also for getting environment variables
  • pandas for working with the dataset
  • pinecone for working with the Pinecone SDK
  • re for working with regular expressions (RegEx)
  • requests for making API requests to download our dataset
  • statistics for some handy stats methods
  • sentence_transformers for our embedding model
  • swifter for working with the pandas dataframe

On line 16, we provide some boilerplate code to tell Flask the name of our app.

On lines 18–20, we define some constants that will be used in the app. These include the name of our Pinecone index, the file name of the dataset, and the number of rows to read from the CSV file.

On lines 22–25, our initialize_pinecone method gets our API key from the .env file and uses it to initialize Pinecone.

On lines 27–29, our delete_existing_pinecone_index method searches our Pinecone instance for indexes with the same name as the one we’re using (“plagiarism-checker”). If an existing index is found, we delete it.

On lines 31–35, our create_pinecone_index method creates a new index using the name we chose (“plagiarism-checker”), the “cosine” proximity metric, and only one shard.

On lines 37–40, our create_model method uses the sentence_transformers library to work with the Average Word Embeddings Model. We’ll encode our vector embeddings using this model later.

On lines 62–68, our process_file method reads the CSV file and then calls the prepare_data and upload_items methods on it. Those two methods are described next.

On lines 42–56, our prepare_data method adjusts the dataset by renaming the first “id” column and dropping the “date” column. It then combines the article title with the article content into a single field. We’ll use this combined field when creating the vector embeddings.

On lines 58–60, our upload_items method creates a vector embedding for each article by encoding it using our model. Then, we insert the vector embeddings into the Pinecone index.

On lines 70–74, our map_titles and map_publications methods create some dictionaries of the titles and publication names to make it easier to find articles by their IDs later.

Each of the methods we’ve described so far is called on lines 95–101 when the backend app is started. This work prepares us for the final step of actually querying the Pinecone index based on user input.

On lines 103–113, we define two routes for our app: one for the home page and one for the API endpoint. The home page serves up the index.html template file along with the JS and CSS assets, and the API endpoint provides the search functionality for querying the Pinecone index.

Finally, on lines 76–93, our query_pinecone method takes the user’s article content input, converts it into a vector embedding, and then queries the Pinecone index to find similar articles. This method is called when the /api/search endpoint is hit, which occurs any time the user submits a new search query.

For the visual learners out there, here’s a diagram outlining how the app works:

App architecture and user experience

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: