Web-Based Database-Powered Chatbot Project — Module 1: Approximate String Matching

Original Source Here


As you saw in the summary above, I’ve always been interested in chatbots. It turns out that I’ve now just built (actually I keep extending every day) a complete chatbot, all web-based, that pops up in different pages of my website guiding its visitors and answering their questions. I wrote this chatbot as an HTML/JS component, so I can easily incorporate it into any page with just a script tag. By using the browser’s localStorage feature my chatbot can keep fluent conversations when the user opens up multiple pages of my site, even if in different tabs or windows. By using CSS the chatbot can easily adjust to smartphone or computer screens. And I let the bot write the conversations to a file on my site, which I can then inspect to learn what visitors usually ask and what deviations they usually make from the core knowledge database, so that I can then improve the chatbot accordingly.

But I will talk about the design and features in some future article. Here I want to describe the first track of its main module, the module that allows the bot to directly respond to questions of my visitors and also engage in some basic chit-chat extracted from a database parsed through approximate string matching. In an upcoming article I will describe the chatbot’s second track, which extends its capabilities enormously by combining a knowledge base with text generation by GPT-3.

Answering questions from a knowledge base with tolerance to typos and variable wording -thanks to string comparison algorithms

There are many ways to classify chatbots, but one very important distinction is whether they only provide answers taken literally from a database or they can actually make up text that stands as a reasonable reply to the question. The latter is much more challenging from the programming point of view, requiring some form of high-end natural language processing protocol to transform questions into reasonable answers. The main advantage of this AI-based approach is that, if well done, it is very generalizable and can provide correct answers to questions asked in many different ways; besides, it will be natively tolerant to typos and grammar errors. But it is not easy to build such kinds of AI programs; and most importantly, you always risk that the bot might make up text that includes incorrect information or even inappropriate content, sensitive information, or just unreadable text. High-end natural language processing programs such as GPT-3 can be a good solution, but even these can make up text with errors or inappropriate content.

I will turn into the GPT-3 module of my chatbot soon, whereas here I will develop on the module that works through question/answer matching. In case you’re very curious and can’t wait for my article describing the bot’s GPT-3 module, let me share with you here some pieces I wrote about GPT-3 where you’ll find some negative and positive points plus quite a bit of code and instructions for you to make your own tests with this technology:

And be sure that I will come back to GPT-3 soon, as it powers the second track of my website’s chatbot’s brain.

The question-answer module of my website’s chatbot

Here you have a screenshot of my chatbot using its question/answer matching track to chat with a human user who happened to ask for jokes:

The chatbot telling jokes from its knowledge base.

Let’s see how this question/answer matching module, alternative to complex AI models and safer because it only replies what’s coded in the database, works. Essentially, this consists in “simply” searching for the human’s inputs inside a database of questions and answers, and then providing the corresponding pre-compiled answer/s. If the database is sufficiently large and the human is warned that the bot’s code is limited to certain topics only, the overall experience might be just good enough, at least within its intended usage.

There are a couple of important points to treat, though:

  • A “sufficiently large” database of question-answer pairs is not easy to get, at least not one whose content you can be sure about.
  • Humans can make typos and writing errors, so the chatbot should ideally be tolerant to them.
  • Even without any errors, humans can (and will most likely) ask questions in ways different to those coded in the dataset. The chatbot should try to account for this too.

The solution to the database problem is not easy. As I detail below, for my chatbot I took an open-source (MIT license) database of question-answer pairs from Microsoft’s GitHub account, and extended it with content specific about me and the projects I work on -because that’s what the chatbot is supposed to answer. Meanwhile, the difficulty introduced by typos, errors, and input variability can be tackled by searching the database not for exact questions but for questions that resemble each of the entries in the database. This requires the use of string comparison metrics rather than perfect matching, and quite some cleanup of the human’s input before doing the search.

Let’s see all these points one by one.

1. Cleaning the human input

I coded into my chatbot a function that can clean up different aspects of human input. Depending on the arguments, the function will extend contractions, remove symbols, remove numbers in multiple forms, or remove stopwords. Notice that this means the database should preferably not have any symbols, numerical information, or contractions, as they will lower the match scores upon search. For example, all instances of “website’s” in the database are expanded to “website is”.

The operations involved in cleaning seem trivial, but again are limited by the availability of databases, for example of stopwords. I compiled a quite long list from some resources, that you can just now borrow -but please acknowledge me, just as my code acknowledges my sources!

Here’s the full function, including the lists of stopwords, symbols, etc.: http://lucianoabriata.altervista.org/chatbotallweb/cleaningfunction.txt

Notice that on calling the function one can choose what to clean. Some parts of my code request full cleanup, while others request cleanup of symbols and numbers but not stopwords. Besides, my code also cleans up other potential sources of problems right before creating the search query for the knowledge database. For example, it replaces occurrences of “he”, “him” and “his” by “luciano” -because I assume that anybody asking my website’s bot about a third person refers to me, and it is coded like this in the database. Of course, this will not work properly if the visitor is actually asking about another person… Anyway, the database has “Luciano” everywhere in its answers, so it will be clear that the answers refer to myself even if the human might be thinking about somebody else. Likewise, part of the cleaning procedure is taking all the inputs to lowercase and having all questions of the database in lowercase too (while all answers are properly capitalized). Besides, all inputs and questions are fully trimmed off any terminal spaces.

2. Database of question-answer pairs

For my chatbot I took the English version of Microsoft’s personality chit-chat database for chatbots, and started adding content specific about me and the projects I work on. Why? Well, because the whole point of my bot is to guide the visitors of my websites and answer their questions about me and my projects -and of course Microsoft doesn’t know anything about me and my projects!. In fact, when the visitor lands on my website the chatbot already explains that its knowledge is quite limited to talking about certain topics (which I introduced manually into the database) and basic chit-chat (from Microsoft’s database plus some custom additions and edits).

Here’s the database I used from Microsoft. As you can see there are different languages and personality styles supported:

I actually reshaped this file to have question-answer pairs in the same line, which makes it much easier to then add more entries. And for many questions, I give multiple possible answers so that when a visitor repeats a question or asks two very similar questions, the bot doesn’t always repeat itself.

This is an example entry from the knowledge base:

hello||good morning||good evening||good night||whats up||wasup||whasup##***##***##Hello, all good?||Hi there, how are doing?

The line is separated by ## delimiters in 4 fields: the first field contains all the possible ways to ask a question (well, here just some ways to say hello, and there are more in another line) separated by ||. The last field is a list of possible answers, again separated by ||, here two different options.

The second field contains a kind of “disclaimer” that the chatbot will use if it finds only a partial match to one of the questions, before giving the possible answer, to produce a more natural conversation. For example, if the user asks “what’s your name?” with some typos, then the bot will answer “Asking for my name?” followed by one of the preset answers. Notice that, the way I coded my bot, this will not trigger if the typo is very small. For example, here I ask first with a single typo (which results ina direct answer) and then with multiple typos (where the answer is preceded by a small disclaiming sentence):

Example of “disclaimer” sentence used when the match to the questions in the database is poor (but non null, in which case the bot will just say it didn’t get what was asked).

The third field retains keywords representing the main topic of the question-answer pair, useful to help maintain at least some minimal context in a conversation. For example, here the human asks about moleculARweb (a website I developed together with a colleague at work) and then asks another question about referring to it by “it”… and the chatbot gets it:

The chatbot remembering the main topic of the conversation.

3. Searching the database

Of course the fastest option to search questions is to simply match the input text typed by the human to each possible input listed in each possible line. And the chatbot does this, as the first thing it attempts. If a match is found, the code randomly chooses one of the listed answers and displays it. Try for example asking my chatbot for some jokes:

My website’s chatbot telling some jokes. By using random numbers and saving its last outputs, it doesn’t repeat itself too much.

The chatbot also does this perfect-match search by cleaning the human’s input and all possible inputs from all their stopwords. Again, if there’s a perfect match it will display an answer from the list. But before doing this, the chatbot attempts to find the question as typed by the human inside the database, this time allowing for typos, grammar mistakes, and even swapped words. To achieve this it uses two functions that measure string similarity:

The Jaro-Wrinkler Distance, which measures the edit distance between two sequences, i.e. the minimum number of operations required to transform one into the other. It ranges from 0 to 1, with 1 meaning perfect match. See here for a Wikipedia entry. and here for the original paper by Jaro and Wrinker.

The Text Cosine Similarity, which measures how well the number of occurrences of each word matches between the two strings, as the cosine of the angle formed by the n-dimensional vectors made up of all the frequencies of all n words from both strings. It also ranges from 0 to 1 with 1 meaning perfect match. It is one specific application of the more general cosine similarity, about which you can read in this very nice article at TDS by Ben Chamblee.

Notice that by construction, the Jaro-Wrinkler distance will score pairs of strings with typos and different spellings highly similar. Say for example super and suter or modeling and modelling. On the other hand, text cosine similarity will score single words with typos very bad, because each will be counted as a different word for the frequency calculations. But on the contrary and unlike the Jaro-Wrinkler distance, the text cosine similarity metric will score pairs of sentences made up of the same words in different orders as matching perfectly. Thus, clearly these two types of string similarity metrics are very complementary. That’s why I incorporated both in my chatbot, and assume match when any of the two scores is above a threshold.

The Jaro-Wrinkler and text cosine similarity metrics are quite complementary, so I gave my chatbot the option to use both. If any one of them is above the similarity threshold, it is taken as the actual input the user typed (or might have wanted to type).

The threshold actually has 2 levels: when the similarity between the user’s text and one of the questions of the knowledge base is above 0.95, it is treated as a perfect match hence the answer from the base is given right away. If the score is between 0.88 and 0.95, the program gives the same answer but preceded by a variable sentence of the kind “Did you mean this?”. If the score is between 0.8 and 0.88, the chatbot clarifies that it’s not sure about the question, followed by a candidate question from the base and its corresponding answer.

But how exactly to compute the Jaro-Wrinkler and Text Cosine similarities?

I took these functions from this excellent article by Suman Kunwar, which provides them already in JavaScript:

In fact this article describes (and gives code) for 4 string comparison functions. But I took only Jaro-Wrinkler and Text Cosine for the reasons given above. And they turned out to work out quite well for me -although no, they aren’t infallible.

As one last note, it is important that the inputs and all possible questions are all in lowercase, trimmed, and cleared off all symbols and numbers. But not off stopwords, which often can help to define the overall meaning of a sentence.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: