https://miro.medium.com/max/1200/0*tDlhHuJtqQ7g2_yf

Original Source Here

Is My Model Really Better?

Why ML models that look good on paper are not guaranteed to work well in production

These days, a typical ML research paper reads something like this:

We propose a new model architecture X. We show that X outperforms SOTA by Y%. We conclude X is better than the current SOTA. Our code is available online.

And that’s where academic research usually ends. However, from a production point of view, this is far from enough. There’s no guarantee that a model that looks good on paper actually makes a good production model.

In this post, we’ll dive into the additional challenges that we face when building models not just for research, but for production. We’ll learn:

why offline performance does not guarantee online performance,
why all errors are not the same,
why, in addition to modeling performance, latency and explainability matter, and
why you shouldn’t necessarily trust ML leaderboards.

Let’s get started.

Offline performance does not guarantee online performance

ML research is focused exclusively on offline performance: how does the model perform on a static, historic test set? In ML production on the other hand we care about online performance: how well is our business doing once the model is deployed and takes actions on real users in the real world?

For example,

in fraud detection, a suitable offline metric could be ROC-AUC, and a suitable online metric could be the chargeback loss from fraud transactions that were missed.
in search ranking, a suitable offline metric is NDGC, and a suitable online metric could be click-through rate.
in ads ranking, a suitable offline metric is also NDGC, but we may want to measure total ads revenue online.

Several industrial ML research teams, such as Netflix and Booking.com, have found that improvements in offline model performance are no guarantee that the model actually works better online. Some models that have better offline performance even do worse online. A scatterplot of offline vs online performance from the Booking.com paper shows no correlation at all:

Better offline performance (x-axis) does not guarantee better online performance (y-axis). Source: Bernardi et al, KDD 2019 (link)

Why does better offline performance not guarantee better online performance?

One of the reasons is that the offline metric is just a proxy for the business metric we actually want to optimize. The two may be correlated, but the correlation is not perfect. If a model overfits to the proxy metric, it can therefore possibly deviate far away from the actual metric of interest. This can be particularly problematic for deep neural networks because of their large numbers of free parameters. The authors of the Netflix paper warn:

“If a deep-learning model is given the wrong problem to solve, it will solve it more accurately than less powerful models would.”

Improving the proxy metric can help. For example, researchers from YouTube found that optimizing the model for watch time works better than optimizing for clicks, because a model optimized for clicks ends up favoring click-bait videos with little value to the user.

Errors aren’t errors

Another problem with proxy metrics is that we’re making the implicit assumption is that there’s no qualitative difference between the populations from which the errors originated. However, that assumption doesn’t always hold. Errors aren’t all the same.

For example, in e-commerce fraud detection, product velocity matters. False negatives for products with higher velocity (such as digital video games) are much more impactful because bad actors can create enormous damage in a short amount of time. So a model with fewer errors overall but more errors for high-velocity products may lead to much more bad debt and therefore worse business performance.

Another domain in which errors have been shown to have qualitative differences is ads ranking. Researchers from Microsoft found that errors at low probabilities have a much bigger impact on the business metric of interest, ads revenue, compared to errors at high probabilities. That’s because it’s much worse to show an irrelevant ad to a user than omitting a relevant one: in the worst case the user may get annoyed enough to leave.

Latency matters

ML research papers rarely discuss latency. After all, the test set is fixed, and we only need to evaluate it once so that we can report a number in the paper.

In production however, the model needs to run as part of a service that’s used by Millions or even Billions of users daily. A critical metric in a user-facing application is single-request latency, which is the time it takes for a single user request to receive a response from the server.

How important is a model’s latency? In an experiment, researchers at Booking.com introduced synthetic latencies to their service in order find out. The result? Latency has a statistically significant, negative correlation with user conversion rates. A 30% increase in latency costs about half a percentage point in conversion rates. “A relevant cost for our business”, the researchers write in their paper.

The authors report that based on this finding, Booking.com optimized their ML system for latency, using simple linear models built in-house with a minimal set of features.

Explainability matters

The difference between a model on paper and a model in production is that the model in production takes actions that impact real users. False positives and false negatives can lead to escalations. And if that happens, we better have explanations for why our model made a mistake. Explainability is therefore another model property that matters in production, but is not that often discussed in academic papers.

Generally, the more complex the model, the harder it is to explain its prediction. For a simple linear model, the model weights themselves encode what the model is looking at, and can give us some insights into why certain mistakes are happening. It becomes more tricky for random forest or boosted trees, and even more so for deep neural networks. Tools such as SHAP can help with explaining decisions of complex models, but add latency. And, as we’ve seen earlier, latency matters a lot in user-facing applications.

If explainability and latency are hard requirements, then a simple linear model may therefore be the best choice for production, even though the modeling performance may be inferior to more complex models.

Look-elsewhere effects and ML leaderboards

A common practice in ML research is to evaluate multiple models on the same test set in order to compare their performance. The problem with this practice is that due to random chance alone, some models are expected to outperform others, even if they are all just as good as each other.

In other words, as long as we try a large enough number of models on the test set, sooner or later we’re guaranteed to find a model that outperforms the model we’re trying to ‘beat’ just by chance alone. This is also known as the look-elsewhere effect.

Generally, the larger the difference in offline model performance, and the larger the test set, the more statistically meaningful the result. The exact statistical significance can be calculated with a statistical framework known as multiple hypothesis testing.

Using this framework, AI researcher Lauren Oakden-Rayner calculated the statistical significance of the leaderboard from a Kaggle competition on a medical image segmentation problem. The result? The difference between model #1 and model #192 can be shown to be statistically significant, given the amount of data and the difference in the score. All models in between ranks 1–191? From a rigorous statistical point of view, we can’t conclude that any one of them is better than the other.

Going back to the question posed in the title: is my model really better? It might not be, it could just look better by chance.

Take-away: some tips for ML practitioners

Let me conclude with a handful of practical tips for your next ML project:

Take offline performance metrics with a grain of salt. View them as health checks, not as a guarantee for production performance. Instead, rely on randomized controlled trials to estimate your model’s performance in production.
In addition to offline performance, measure your model’s latency. Latency has a clear negative correlation with user experience, and you’ll want to avoid increasing it unless the model can offset the negative impact with substantial gains in performance. If explainability and latency are hard constraints, then a simple linear model may be the best approach.
Errors aren’t all the same. Study the impact of different types of errors from different populations on the business metric you’re trying to optimize. A model that has better AUC might make fewer but worse errors, resulting in worse overall performance.
Don’t reply on the top-performing model from a ML competition leaderboard. It might have just been lucky.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot