Original Source Here
Open Domain Classification
When we designed our chatbot, we realised that there will be some queries that we cannot handle. For quite a while, our approach was to use a threshold on the intent confidence that was predicted by our classifier. This threshold is a hyperparameter that we added by reviewing real feedback from our customers. However, we still kept getting false-positive results for some clear out of domain queries.
This is a bad customer experience. Our goal for a chatbot is to be good with those that the chatbot is supposed to handle. But, also ensure that the chatbot clearly backs out from those queries that it is not supposed to handle.
There are 3 reasons for the chatbot to behave erratically for open domain queries —
- Closed world assumption while designing chatbot.
- Overconfidence of Neural networks .
- Propensity of ReLU networks to yield high confidence predictions.
In this article, we will talk about these 3 issues and our strategy to implement open domain classification for our chatbot.
Closed World Assumption
Geli Fei and Bing Liu’s paper defines closed world assumption –
The closed world assumption, which focuses on designing accurate classifiers under the assumption that all test classes are known at training time.
Most text classification systems operate under the closed world assumption. For many applications, this assumption is sufficient. If we want to classify emails into spam or ham or if we want to classify reviews as positive, neutral or negative, closed world assumption is good enough.
However, for chatbots, the closed world assumption is not realistic. A more realistic scenario is to expect unseen classes during testing (open world). Thus, the goal of the chatbot is not just to classify the chat text into one of the multiple intents, but also to reject text that is not in any of the known classes.
The typical Softmax based multi-class classification system is not designed to detect out of domain classes. And, as we will see in the next sections, because of the mis-calibration of confidence scores, thresholding on confidence is not the best way to detect out of domain classes.
Overconfidence of Neural Networks
Neural network-based classifiers have improved on accuracy. But improvement of accuracy has come at the cost of calibration. Calibration of a neural network is the notion that the output confidence is equal to the probability of of correctness. If the output confidence of the Softmax layer for a class is 0.8, and we had 100 independent predictions, each with a score of 0.8, we will expect 20 of those predictions to be wrong. However, modern deep neural networks (DNN) are mis-calibrated. That if in the above examples, the scores came from a DNN, more that 20 prediction would be wrong.
This change from the older (classical and not-so-deep) was empirically measured in a paper by Chuan Guo et al. In the paper, they show that in shallow networks (LeNet) the accuracy of the network is close to the average confidence score. However, in a deeper network (ResNet) the average confidence score is much higher than the accuracy. The authors empirically demonstrate that the deeper network is overconfident.
While the authors do not replicate the experiment for text classification, we can argue that the same phenomenon holds true. Even in text classifiers, the accuracy gains of DNNs come at the cost of mis-calibration.
To detect out of domain queries, we can use a high threshold for class. But since the threshold is a hyperparameter, the only way to know its value is to get real data post the release of the virtual agent. Thus, the value can only be obtained from negative user feedback. The second problem is that in our practical experience, we note that the level of ‘over confidence’ is different for each of our class. Thus, in the ideal case, we will need an independent hyperparameter for each of the class that we support in our virtual agent.
ReLU networks yield high confidence predictions
The final issue relevant to this problem is that an inherent issue in ReLU based neural network. This problem is discussed in a mathematically dense paper by Matthias Hein et al. In that paper, the authors first prove that ReLU based neural networks are continuous piecewise affine classifiers. Using this result, they then prove that as long as some mild conditions are true, a ReLU based network will always predict a false positive. The paper further proves that the false positive is not necessarily close to the training example in the input space. In fact, they prove that if you were to try, you will always find an example far away from the training examples of that class that will give a high confidence prediction for that class.
This last point proves that any high value of threshold is not a sufficient condition to remove out false positives for out of domain classes. Even if we keep the threshold high enough, there are always examples that are away from the training examples that will be have high confidence prediction.
Solution to this problem
We have established that under a closed world assumption, a ReLU based DNNs will make false classifications. Because of the inherent overconfidence of ReLU based DNNs thresholding of the confidence score is not a reliable way to predict out of domain classes. A typical approach to multi-class classification is given below.
We used an approach as proposed in the paper by Lei Shu et. al to solve the problem of mis-classification in the Virtual Agent. This approach works well for us as we neither need additional training data nor do we have to change the way we use the classifier. A digram for the approach is below —
In this method, instead of Softmax, we use multiple sigmoids and thus get multiple probability scores, one each from each of the Sigmoid units. This configuration allows us to create a simple out of domain detector — if none of the sigmoids are above the threshold, we declare the input text as out of domain. However, if multiple sigmoids have output above the threshold, we choose the class that has the highest activation.
This architecture is training by minimising the cumulative negative log likelihood over all the classes. The paper uses an approach similar to anomaly detection to further fine tune the prediction. Instead of using a fixed 0.5 (or alternate) as the threshold for positive class, the authors introduce a variable threshold. An external parameter alpha controls this threshold. Alpha is the number of standard deviations away from mean that would still qualify as a positive class. For further details, please refer to the referred paper.
Users of a chat system may not be aware of the limitation of the designed system. They will have queries that we do not design the chatbot to solve. In this article, we have explored the issues with the design and have improved the chatbot. In the new design, the chatbot clearly knows the out of domain queries and can respond gracefully to those queries.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot