Of Research and Rogues

Ian Hellström | 30 April 2021 | 4 min read

When was the last time you read an article in the press on machine learning operations?

The interest in machine learning is at peak levels. The latest advances in machine learning are reported far and wide:

These scratch the surface of the research that’s going on at academic institutions and research labs in industry. But what about making machine learning work in the real world? And ensuring machine learning research is up to snuff and can be replicated? Not so much.

What’s the Problem?

The focus on research as opposed to research and operations cons novice data scientists and business leaders into believing machine learning is all about research and not about operations:

[M]odel development is the most […] celebrated work. Google (2021)

Research is very important work. Research is great for generating buzz in the media, and the researchers deserve their fifteen minutes of fame. Unfortunately, that buzz does not help companies put any models into production. It may even give the impression that everyone is miles ahead of the pack. If the competition appears to build self-driving killer transformers that can write Shakespearean sonnets, compose a fugue in the style of Bach, paint an Monetesque landscape, and find a cure for cancer, while you have yet to figure out whether to use stack or concat on lists of tensors, you might as well give up now.

ML competitions are focused almost exclusively on performance tinkering without providing much, if any, insight into theoretical or even empirical reasons why a technique works and what the consequences of it on production usage could be. It is obvious why: operations is never a concern. And that is a concern.

If at least the research were easily reproducible, it would be less of an issue. Alas, benchmark data sets and standardized problems facilitate the relentless pursuit of novelty:

[T]his hyperfocus on novel methods leads to a scourge of papers that report marginal or incremental improvements on benchmark data sets and exhibit flawed scholarship as researchers race to top the leaderboard. Hannah Kerner (2020)

The Kaggle leaderboard and its ilk are nothing but pissing contests in the same way as LeetCode and HackerRank are: they pander to the same crowd, with potentially negative consequences for diversity.

Replication is not rewarded, and that is a problem for both research and operations. If the results of a study cannot be replicated, its methods, conclusions, and value must be questioned. And if a model cannot be re-run with the same outcome each time, it has no value to operations either.

That’s not a novel problem though. A lot of the benchmark problems in machine learning research have little to do with the real world, which was already observed by Jamie Carbonell as early as 1992. Twenty years later, it is the same story:

Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society. Kiri Wagstaff (2012)

Little has changed in nearly the decade since. Since only 15% of studies in machine learning publish their code, it is doubtful much can be replicated. Transparency appears to be troublesome in both research and industry when it comes to data and machine learning.

A lack of transparency prevents new AI models and techniques from being properly assessed for robustness, bias, and safety. MIT Technology Review (2020)

Robustness, bias, and safety are mostly concerns for operations. While there is some research into robustness (e.g. adversarial attacks) as well as bias and fairness, safety appears to be on very few researchers’ radars outside of researchers in specific industries and ethics.

Beyond Research and Rogues

There exists a cohort of rogue data scientists in enterprises who think machine learning is fiddling with the model performance metrics outside the context of the business, because that is what most research focuses on and the media highlight. Reproducibility, a core tenet of the scientific method, appears to have been banned from data “science” and it’s about time it made a return to instil rigour in a discipline that sorely needs it for both research and operations.

Rigour, reproducibility, requirements specifications, and robustness of algorithms must be central to industrial and high-stakes applications of machine learning, where safety is an absolute must.

Unless the invisible tasks that are essential to the success of machine learning, such as data collection and management, labelling, (code) reviews, reproducing research, ensuring robustness of algorithms in operations, and so on, are treated with more respect and are properly rewarded, we may very well discover that

AI’s greatest contributions to society, […] could and should ultimately come in domains like automated scientific discovery. […] But to get there we need to make sure that the field as whole doesn’t first get stuck in a local minimum. Gary Marcus (2018)