Original Source Here
When Not to Use Neural Networks
A guide on model selection and the inexistence of right choices
A while ago, a friend asked me for some guidance on picking AI models. At the time, most of his AI education was centered around neural networks and their many variants. For him, the understanding he got out of his courses was that NNs are powerful enough to solve anything; they are always the right choice— why bother learning other techniques?
To be honest, I can relate. Back in the day I started my machine learning journey, I also had my share of suspicion over “weaker models” — why learn fossilized techniques if we can jump right at SVMs and ANNs? Fortunately enough, I know better now — and I hope my friend does so too!
In this light, I thought this might actually be a much more frequent question than I ever gave credit for. Therefore, I hope to shed some light on the role simpler models play in the grand scheme of things.
If you are a beginner data scientist, this is of particular interest to you, as picking the right model for each task is one of the most important skills you can develop. Moreover, model selection is one of those things that, when done wrong, can sabotage an entire venture all by itself. Finally, knowing when to use each model can also point out which models you should study next.
So let’s start with the basics.
All Models Are Wrong, but Some Are Useful
Right off the bat, this quote by George Box points us to a fundamental truth: all models are approximations. In other words, there are no right models or correct choices. Thus, by very definition, we are always using the wrong models. However, that’s not to say they can’t be useful.
A useful model is one that brings value to its users: larger profits, lower costs, insights into problems, useful recommendations, a course of actions, etc.
A beginner’s mistake is to think accuracy equals value. For instance, every Amazon page has dozens of product suggestions I will never buy. Accuracy-wise, the algorithm is close to 0% accurate. Yet, the few extra bucks it generates from its occasional hits add up to billions of dollars every year. Likewise, Medium sends me around 20 reading suggestions every day, out of which I might read one or two — enough to keep me visiting the website every single day. Accuracy is just one of the many tips of the usefulness iceberg.
In more specific and non-exhaustive terms, the usefulness of a model boils down to the following four properties:
- Interpretability: how much the model informs about the problem it solves
- Explainability: how able it is to explain the why behind its outputs
- Flexibility: how capable it is of describing complex subjects
- Complexity: how costly it is to be run and trained
For instance, when we inspect a decision tree, we learn how it solves the problem (interpretability), and we can trace which decisions brought an input to a particular output (explainability). The deeper a tree is, the more powerful (flexibility) and expensive (complexity) it will be.
A neural network, on the other hand, is quite opaque. Its weights carry little to no interpretative value nor explain its reasoning. Nonetheless, a sufficiently large network can approximate any function despite its costs.
How valuable each of these properties is problem-dependent. Some problems are pretty intuitive but hard to formalize, such as recognizing faces. These problems often benefit from flexible-but-complex solutions, such as neural networks. Others are unintuitive at first but can often be solved by logical steps, such as accepting/refusing a loan request. A simpler but more explainable model might be your best bet for those problems.
It should already be apparent that, while neural networks are great, they lack some critical properties: interpretability and explainability. Therefore, neural networks might be a poor fit for the task whenever these two properties are needed.
What is left is understanding when these properties are essential.
The Rule of Thumb
A professor of mine once beautifully put it like this: within AI, we use simple models to solve challenging tasks and complex models to solve simple tasks. So let us break this down piece by piece:
- Simple models are the classic machine learning methods, such as linear classifiers, decision trees, k-nearest neighbors, etc. As a general rule, models you could implement yourself in an afternoon without much googling or math.
- Difficult problems are all those tasks we humans need training to solve and some time to think about before we can come up with a good answer. For instance, evaluating the value of a house, reviewing a loan proposal, deciding on a course of action for a patient, etc.
- Complex models are all the heavily-numeric methods, such as SVMs and Neural Networks, or, more broadly, nearly all kernel and gradient-based methods. These you are very unlikely to be able to code yourself to any degree of usefulness, especially without some major googling around.
- Simple problems are all the intuitive tasks we solve each day. For example, when you see someone you know, you don’t stop to think — you instantaneously recognize (1) it is a person, (2) who this person is, and (3) its facial expression. In fact, you are reading this text right now without giving any real thought to the shape of the letters or how each syllable corresponds to the sounds you internally hear. Even more so, you don’t even know how the millions of neurons inside you connect so you can understand all this.
Rewriting my professor’s words, what is easy for the AI is usually hard for us. Likewise, what is dead simple for us is utterly complex for machines to solve.
Following this reasoning, on intuitive problems, we can readily validate the algorithms’ outputs by just looking at them. We don’t need an explanation or insights. In most cases, demanding an explanation from the algorithm is more of a debugging tool than something you would use.
On the other hand, on difficult-for-human problems, checking the algorithm’s why is one of the most powerful ways to check if its reasoning is reliable. Moreover, this reasoning is often helpful for the problem itself, such as presenting a user with why its loan application was refused.
On a more elaborate example, say you trained a model to predict house prices. In theory, this model should be grounded on reality. For instance, larger houses should be more expensive than smaller ones within a similar neighborhood. The same should follow for the number of rooms and floors — if you cannot interpret the model’s weights, you can never be truly sure it will always behave appropriately. Therefore, picking a simpler but auditable model is your safest bet.
Meanwhile, when judging tweets as positive, neutral, or negative, more straightforward techniques such as rating words from -1 to +1 might be too simple, failing to evaluate elements such as irony, sarcasm, and typos. Moreover, when pissed, people can create some interesting new words that no vocabulary will ever have an entry for. In such scenarios, an utterly opaque model, such as word embeddings or a full-blown Transformer, might be the most effective and reliable approach.
Finally, as the second rule of thumb, the more powerful a model is, the less interpretable/explainable it is. For instance, while decision trees are among the most auditable models, random forests are not as much. Likewise, while k-nearest neighbor models are pretty understandable, Gaussian processes are significantly more opaque. Another example is the linear SVM model compared to all of its kernel variants, which are pretty hard to wrap your head around.
On the same topic, when composing models, the effect is often detrimental to the interpretability/explainability aspect. For instance, training a decision tree on PCA features is considerably less readable than a pure decision tree approach. The same goes for ensembles.
The Complexity Scale
For convenience, here is an (opinionated) list of how the most common methods relate to this notion of interpretable versus complex scale.
Highly interpretable/explainable models: all linear models, logistic regression, decision trees, k-nearest neighbors, etc. These models are considerably more “algorithmic” than “mathematical.”
Intermediate models: boosters and forest models (e.g., XGBoost), naive-Bayes, Gaussian processes, etc. Generally, methods in this category are either more powerful versions of simpler models (e.g., decision forests to decision trees) or are significantly more mathematical.
Advanced models: kernel SVMs, neural networks, Bayesian models, etc. Methods in this category are often the only viable approaches for various problems and/or are custom-tuned to possess specific features (such as uncertainty estimates). Moreover, they can vary in complexity (e.g., Transformers versus MLPs).
As in everything in life, there are always some scenarios where the evident approach isn’t what you need. So here is a non-exhaustive list of scenarios in which you might sub a simple model for a complex one or vice versa.
Computer Vision tasks: when dealing with images and videos, we usually have no other option than CNNs and Vision Transformers (ViTs). However, the formulation you use can significantly impact the usefulness of your model. For instance, segmentation/detection models are somewhat more explainable than image classifiers (after all, they tell you where the object is). It can also be worthwhile to explore mixed approaches, such as using a CNN to parse a scene and a simpler model to reason over its results.
When Accuracy is a Priority: back to the house price predictor, it might be more attractive for a company to get the best estimates at all costs than to have a stronger sense of why. The same might apply to an asset value predictor within a day-trade investment setting. In such scenarios, a complex neural network might be just the overkill you need (just make sure you always remember it can all backfire someday)
When Explainability is a Priority: in some settings, such as a medical pipeline, having an explainable model might be a legal requirement, not an option. In such cases, your best bet is to use the most complex approach that still complies with the needed requirements. A good suggestion here is an auto-tuned XGBoost model.
Using Add-Ons: while I said neural networks are neither explainable nor interpretable, some literature is dedicated to fixing such issues. For instance, one can use SHAP values to derive some meaningful insights into the inner workings of models. For some problems, these hybrid approaches can be pretty sufficient. However, do not expect any miracle out of this.
So far, we have considered what has been recently dubbed “a model-centric approach”: given a problem, we look for the best model to solve it. While this is generally the case for many tasks, in the real world, problems can also be changed. For instance, one can often reframe problems to tackle them in more manageable ways within a company. Similarly, when trying to increase a model’s accuracy, one can just fetch more data points instead of trying everything the state-of-the-art has to offer (which is A LOT).
Therefore, I am ending this piece by reminding you the real world offers a lot more possibilities than what you might have seen in your university degree. The real world is data-centric. While picking the right model is incredibly important, always make sure you have excellent data as well.
If you are new to Medium, I highly recommend subscribing. Medium articles are the perfect pair to StackOverflow for Data and IT professionals and even more for newcomers. Please consider using my affiliate link when signing up.
Thanks for reading 🙂
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot