Original Source Here
Demo App Code Walkthrough
We’ve gone through the inner workings of the app, but how did we actually build it? As noted earlier, this is a Python Flask app that utilizes the Pinecone SDK. The HTML uses a template file, and the rest of the frontend is built using static CSS and JS assets. To keep things simple, all of the backend code is found in the
app.py file, which we’ve reproduced in full below:
Let’s go over the important parts of the
app.py file so that we understand it.
On lines 1–14, we import our app’s dependencies. Our app relies on the following:
dotenvfor reading environment variables from the
flaskfor the web application setup
jsonfor working with JSON
osalso for getting environment variables
pandasfor working with the dataset
pineconefor working with the Pinecone SDK
refor working with regular expressions (RegEx)
requestsfor making API requests to download our dataset
statisticsfor some handy stats methods
sentence_transformersfor our embedding model
swifterfor working with the pandas dataframe
On line 16, we provide some boilerplate code to tell Flask the name of our app.
On lines 18–20, we define some constants that will be used in the app. These include the name of our Pinecone index, the file name of the dataset, and the number of rows to read from the CSV file.
On lines 22–25, our
initialize_pinecone method gets our API key from the
.env file and uses it to initialize Pinecone.
On lines 27–29, our
delete_existing_pinecone_index method searches our Pinecone instance for indexes with the same name as the one we’re using (“plagiarism-checker”). If an existing index is found, we delete it.
On lines 31–35, our
create_pinecone_index method creates a new index using the name we chose (“plagiarism-checker”), the “cosine” proximity metric, and only one shard.
On lines 37–40, our
create_model method uses the
sentence_transformers library to work with the Average Word Embeddings Model. We’ll encode our vector embeddings using this model later.
On lines 62–68, our
process_file method reads the CSV file and then calls the
upload_items methods on it. Those two methods are described next.
On lines 42–56, our
prepare_data method adjusts the dataset by renaming the first “id” column and dropping the “date” column. It then combines the article title with the article content into a single field. We’ll use this combined field when creating the vector embeddings.
On lines 58–60, our
upload_items method creates a vector embedding for each article by encoding it using our model. Then, we insert the vector embeddings into the Pinecone index.
On lines 70–74, our
map_publications methods create some dictionaries of the titles and publication names to make it easier to find articles by their IDs later.
Each of the methods we’ve described so far is called on lines 95–101 when the backend app is started. This work prepares us for the final step of actually querying the Pinecone index based on user input.
On lines 103–113, we define two routes for our app: one for the home page and one for the API endpoint. The home page serves up the
index.html template file along with the JS and CSS assets, and the API endpoint provides the search functionality for querying the Pinecone index.
Finally, on lines 76–93, our
query_pinecone method takes the user’s article content input, converts it into a vector embedding, and then queries the Pinecone index to find similar articles. This method is called when the
/api/search endpoint is hit, which occurs any time the user submits a new search query.
For the visual learners out there, here’s a diagram outlining how the app works:
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot