Original Source Here
There are growing concerns about the scientific rigor of Machine Learning research. In a 2017 speech at the NIPS conference, Ali Rahimi and Ben Recht, then at Google AI, argued that ML has become alchemy, referring to the fact that practitioners use methods that work well in practice but are poorly understood on a theoretical level. Likewise, Francois Chollet, known as author of the Keras Deep Learning library, likens today’s ML practitioners to ‘cargo cults’, where people rely on ‘folklore and magic spells’.
Alchemy, cargo cults, magic spells. This is noteworthy criticism about a field that has seen such a rapid progress, with increasingly wide-spread real-world applications. And it exactly the wide-spread use that worries Rahimi and Recht:
“If you’re building photo sharing services, alchemy is fine. But we’re now building systems that govern health care and our participation in civil debate. I would like to live in a world whose systems are build on rigorous, reliable, verifiable knowledge, and not on alchemy.”
ML is an empirical field: we simply don’t have a theory that would explain why certain methods work and others don’t, and it is not even clear if such a theory will ever exist. But the lack of theory by itself is actually not the main problem. Even in an empirical field of research, progress can be made in a scientifically rigorous way.
I argue that the crucial distinction between science and alchemy starts with the role that practitioners assign to the scientific hypothesis.
ML practitioners face an overwhelming amount of complexity, from dataset sampling and cleaning to feature engineering to model selection and hyper-parameter tuning. Tinkering with these components, and seeing what works best, usually on the test set, has become the norm.
But tinkering alone does not make a science. The fundamental difference is the role of the scientific hypothesis: scientists first formulate a hypothesis, and then design an experiment to test that hypothesis. The hypothesis is then either rejected or accepted, and either way, we have produced new knowledge. The scientific method is agnostic about the outcome of the experiment.
On the other hand, tinkering is not driven by hypotheses, but more by ‘gut feelings’. That’s ok if the goal is merely to explore a phenomenon. But things get dangerous when tinkering is being masqueraded as science by HARKing, i.e. formulating a hypothesis that fits the results after the results are known.
HARKing is misleading because it fools not only the researcher but the entire community. In the worst case, the researcher might perform a large number of experiments with different variations of an algorithm, pick the version that achieves the desired result, which in practice means that it beats the latest state-of-the-art benchmark, and apply HARKing to justify this choice. This is also colloquially known as SOTA-hacking.
Of course, the more random experiments are being run, the more likely it is to beat any given benchmark just by chance alone: this is also known as the look-elsewhere effect. Even worse, SOTA-hacking takes up resources that could be spent on actual innovation. Facebook engineer Mark Saroufim writes in ‘Machine Learning: The Great Stagnation’:
With State Of The Art (SOTA) Chasing we’ve rewarded and lauded incremental researchers as innovators, increased their budgets so they can do even more incremental research parallelized over as many employees or graduate students that report to them.
Formulating a scientific hypothesis before running experiments is the best protection mechanism against HARKing and SOTA-hacking. In their paper ‘HARK side of Deep Learning’, data scientist Oguzhan Gencoglu and colleagues even advocate for a ‘result-blind’ submission process for ML research papers: let the scientists submit their scientific hypothesis along with the experimental design. After acceptance they can then go ahead and perform the experiment, under the condition that they have to publish the result, no matter whether it confirmed or ruled out the hypothesis. It’s a drastic, impractical, and probably unrealistic solution, but it would for sure eliminate SOTA-hacking.
What ML can learn from Physics
As ML research evolves, I believe that it could benefit by borrowing from Physics. One of the fundamental ideas in Physics is to consider a small toy problem that is easier to solve and could give valuable insights in the context of the larger, more complex problem.
This is not to say that these Physics-style experiments are not being done, but they are in the minority. Notable examples, in the context of NLP, are studies that reveal the sensitivity of the famous BERT language model to metonymies, polysemic words, or simply the order of the input sequence. For example, the latter study found that, when trained on GLUE benchmark tasks, BERT is relatively robust against word order, indicating that most of the signal does not come from context but instead from other cues such as keywords.
In addition to toy problems, another powerful empirical methodology is that of ablation studies, the practice of deliberately leaving out one component of the solution at a time to distinguish the crucial components from ‘bells and whistles’ with no actual impact. In the context of NLP, a good example is the well-known 2017 paper ‘Attention is all you need’, which showed that recurrence in language models becomes redundant in the presence of an attention mechanism. Another good example is the 2017 paper ‘On the Role of Text Preprocessing in Neural Network Architectures’, which showed that, with the exception of lowercasing, common text preprocessing techniques (text cleaning, stemming, lemmatizing) provide no measurable improvement in the downstream ML model performance.
From alchemy to science
A lot of today’s ML practice feels like alchemy, but, as in the examples I have briefly mentioned, even in the absence of theory, certain experiments can be done to gain deeper insights into ML’s inner workings and place the field on a more rigorous scientific footing. In particular, here are my 3 recommendations to fellow ML practitioners:
- Be explicit about your hypothesis prior to any experiment. Avoid HARKing and the temptations of SOTA-hacking.
- Be creative: think of particular toy problems that can either confirm or rule out a hypothesis that has been made (implicitly or explicitly) within the community.
- Use ablation studies to identify the crucial pieces and eliminate the ‘bells and whistles’ in your ML solution.
Lastly, my hope is that as ML research evolves it will move away from its current focus on breaking performance benchmarks and towards more fundamental understanding. Science, after all, is the pursuit of knowledge, not wins. I agree (and end) with Rahimi and Recht, who write:
Think about how many experiments you’ve run in the past year to crack a dataset for sport, or to see if a technique would give you a boost. Now think about the experiments you ran to help you find an explanation for a puzzling phenomenon you observed. We do a lot of the former. We could use a lot more of the latter.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot