Original Source Here
Certainty matters in high-stakes AI
Machine learning has often been associated with pragmatism: the right solution is the one that works best in the real world and gives the correct predictions more of the time than anything else. If a particular AI model used by a retailer for recommending items to purchase leads to more website clicks than any other model, or perhaps more additional revenue, it’s the approach to go with. This holds for tons of AI applications which are offering valuable capabilities to different organisations — everything from automated image tagging to routine speech recognition.
As AI starts to play an evermore prominent role in high stakes decision-making, it’s not enough just to be right most of the time. Model accuracy tells us something about how the model performs on average, but when we need to make a single decision with significant human or financial consequences, we need to know a bit more. For example, consider a hypothetical medical imaging AI application designed to help doctors give patients a diagnosis; let’s say the system gives the correct answer for 98% of patients. If you are the one sat in the clinic, all that matters is whether it made the right predictions for you. If your medical images suggest a borderline diagnosis where it is hard for the AI system to be sure, would that not be more useful to know than simply that the model is wrong 2% of the time overall?
The same issues apply to business decisions. Consider that you are a VP in an R&D organisation, seeking to prioritise which product formulation to invest in. If an AI model predicts that a particular product formulation idea will be a success, should your R&D department invest in months of product development? If you can only progress one product, the average model performance is not quite so important; you need a measure of the uncertainty that the formulation selected will be a successful one. Indeed, you don’t have to think for too long to spot that these kinds of high-stakes decisions are everywhere, from facial recognition applications to autonomous vehicles. What we really need to know is how certain the system is that its prediction is correct on a case-by-case, decision-by-decision basis. Then we can evaluate the risks and make a decision.
Neural networks are the workhorse of many modern AI systems and typically give a numerical output when making a prediction. For example, to prioritise that product formulation above, we might train a classifier which would output a score for each formulation idea. It would be tempting to think of this score a probability, where a score of 0.5 would mean a 50% chance the formulation will be developed into a successful product. Unfortunately, this isn’t generally the case, and assuming that AI models’ outputs are probabilities can lead to serious miscalculations of the risks in decision making.
Unfortunately, neural networks are usually overconfident in their predictions, assigning higher scores than the true probabilities. To see why this is, it’s useful to think about the sources of errors in AI model predictions. The most obvious situation is where there are errors or uncertainties in the underlying data on which the model is trained. These so-called aleatoric errors inevitably lead to uncertainties in the model predictions no matter how many data are collected, and for any choice of model. Less straightforward to capture are uncertainties which arise from model itself: there are multiple neural networks which fit well to a set of training data and perform well during validation and testing, but which give different predictions when applied to the case of a specific new decision. This leads to epistemic uncertainties which, from their very nature, are not automatically captured by a single neural network.
One solution could be simply to get more data and suppress these model uncertainties and, if feasible, that’s a good start. However, in many real-world scenarios, we need to contend with the idea that our AI model will be applied in situations which are not entirely represented by training data which the model has seen before — however much data we collect. This problem of out-of-distribution data can appear in all sorts of circumstances. It could be as simple as human error in supplying the wrong data or simply a situation which is too rare to be seen during model training. Returning to the R&D example above, it may be undesirable for new product formulations to be in-distribution since successful products need to have a degree of novelty. In this case, it is crucial to understand how certain we can be about such predictions.
So what is the solution? Of the several approaches to tackling uncertainty in AI models, building an ensemble of different models is one of the simplest and most effective approaches. Building ensembles is well known to give improved model performance for high-capacity models such as neural networks, but it also turns out to be an effective way to craft models which can give good estimates of their uncertainty. Perhaps the simplest case is when the output of the system is a classification, where due to some useful properties of the objective used to train the model, estimates of the model certainty can be found from the average of the scores of multiple neural networks. It’s really that straightforward.
If we have a recipe for quantifying the uncertainty in predictions, why is this not done routinely? The recipe maybe straightforward, but estimates of uncertainty still need to be validated in practice. (Analogously, recipes to train and select machine learning models are well understood, yet we know validating model performance of AI systems is essential.) To validate the performance, we need to look at the calibration error of the model: how well does the model know how certain it is in its predictions? A well-calibrated classifier will be correct 50% of the time when it claims to be 50% sure, 90% of the time when it claims to be to 90% sure, 99% of the time when it claims to be 99% sure, and so on. To be confident in the uncertainties which AI models report, we need to see how good they are in practice. For example, when pooling all of the predictions with probability close to 0.8, do we find four fifths of the predictions are correct? Crucially, does this hold for datasets different from those used in model training? This is usually what matters in the real world.
How universal are ensembles to capture uncertainties in AI? I have found they work well across data modalities and neural network architectures — everything from structured data to graph models. However, they do require some effort. As already mentioned, additional data, ideally different from the training data used to train the models, is needed to validate whether estimates of model uncertainty can be relied upon. Secondly, training multiple models instead of one potentially comes with significant expense and environmental cost, especially given the scale of some contemporary model architectures. Indeed, recent research has been directed at effective approaches to reap the benefits of model ensembles, without the huge training costs, and even capture uncertainty from a single neural network by considering the distances from new data to the model’s training data.
The story on approaches and techniques to develop models which can report their own uncertainty is still being written. But the importance of capturing AI uncertainties to inform risk-based decision making is well established, and will only grow as AI is used for more high-stakes decisions.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot