How to Spot a Rogue Data Scientist

The last few years have seen a massive influx of “data scientists” who complete a few MOOCs, participate in a Kaggle competition or two, and think that experience is representative of life as a data scientist in the enterprise. If that mindset persists, and they show no desire to change it, they are what I call rogue data scientists.

A rogue data scientist is anyone in industry who is paid to deliver results through data and machine learning, but has little incentive and interest to operationalize their work. In fact, they mostly work for their own intellectual enrichment.

It’s important to note that I am not talking about researchers in academia or industry who typically work towards publications and not the release of production-grade software. Few organizations do pure research, although large tech companies may give a different impression. My focus is not on real data scientists either, who understand that data, development, and deployment are key to success, not merely one of three. A rogue fails to grasp that data science in industry is applied, not theoretical. And applied implies messy, because the real world is messy.

What about data scientists who merely create visualizations or write ad hoc queries to answer various questions from the business? Sorry, but these are data analysts with an inflated job title. By the way, congratulations on finding a company that pays a data scientist’s salary in return for an analyst’s job.

A rogue data scientists is anyone to ticks more than a few boxes in each of the groups listed below: academic mindset and résumé padding. Behaviour in line with one or two traits listed below is by no means worthy of the ‘rogue’ epithet. Please be mindful of that when calling data scientists rogue, even behind their backs. They may not like it.

Academic Mindset

ML is Modelling

Academia, online courses, and data science competitions paint a picture of well-labelled data sets and machine learning as being restricted to mere modelling. The model is seen as the most prestigious part.

Outside of the lab, the model itself takes up only a fraction of any production machine learning system. Data science is therefore less about novel research and machine learning models, and way more about scrubbing and shaping data, otherwise known as dumpster diving for diamonds in the data. But of course that’s not exhilarating enough to a rogue:

AI practitioners […] view data as ‘operations’. Google (2021)

Novelty above All

The value of simple solutions to the business is irrelevant to a rogue: complexity of any model is indicative of sheer brain power and having the largest brain in the room counts. Originality always beats actual results. Paradoxically, the rogue data scientist applies mostly recipes lifted from various online courses, bootcamps, and blogs; development of novel algorithms is rarer than a full-stack rogue data scientist.

The pursuit of novelty can lead to incessant algorithm tuning to improve performance only marginally in the lab. It’s rare for rogue data scientists to work on improving the data quality or fall back on data augmentation techniques to boost the model’s performance. After all, that’s janitorial work for the data engineers.

Seeking novelty as a way of life is also behind entry-level data scientists who claim to be proficient in many different frameworks, because they went through the MNIST tutorial for each framework. MNIST is the Hello, World! of machine learning. Don’t be fooled by it!

Scripts as Solutions

Few in academia write high-quality software. After all, it’s not the goal, which is why it’s rarely taught outside of perhaps computer science.

Not having been taught is not an excuse, once you’re in industry. There are plenty of resources to learn best practices from, including other team members on a cross-functional product team. Reading books, studying open-source code, trying it out yourself, working through problems, and receiving feedback from colleagues are how most engineers perfect their craft.

Pull requests from fellow data scientists are virtually absent, and engineers are never asked for their opinion. After all, they are merely the hired help to operationalize their grand ideas. Rogue data scientists therefore miss out on a key means of learning on the job: code reviews.

A rogue’s code mostly lives inside brittle notebooks with zero automation: copy-paste is as close to automation as a rogue gets. If a rogue produces a script instead of a notebook, don’t be surprised to find a single file with 2,500 lines of code without much structure or documentation. Import statements are everywhere, and the code is a top-to-bottom tangle of a train wreck.

Notebooks are executed by running cells in whichever order makes the exceptions disappear. The way scripts are ‘tested’ is by running them once, maybe twice. On the rogue’s laptop, that is. If it works on these singular occasions, the code is deemed correct. Time to move on.

Production as Somebody Else’s Problem

A rogue’s opinion of programming is that it is a menial task or a nuisance. The model is the only exception, although it is typically trained on a laptop with an extracted sample of data stored in CSV. Any outside interference with their masterpiece is seen as an affront, which is why they prefer to hand over notebooks to machine learning engineers who have to deal with the operationalization.

What does that mean in practice? The engineers look at the code and the dearth of documentation, and decide to redo it from scratch because fixing it would be more cumbersome. Little wonder so many machine learning initiatives fail.

A real data scientist cares enough about operations, so as not to act like the roommate everyone else has to clean up after; the rogue often behaves as though everyone ought to be grateful for mopping up their mess.

The image of the data scientist as an all-round data guru is a mirage in the case of a rogue. The lack of interest in basics of software engineering and data infrastructure means they do not have any idea how data is stored or structured, or how to best access it. If not downright wrong, SQL queries of the rogue are highly inefficient.

Focus on Fun

The rogue kind sees everything beyond what they consider fun not their job. Consequently, everyone else is called upon whenever results promised by the rogue data scientists do not live up to expectations. And they never do.

If you’re not in research, having models work in the lab but not in the real world is a clear sign you’re nowhere near done. A real data scientist understands that.

PhD Optional

A PhD is often a good indicator, although by no means a necessary or even sufficient condition to be considered a rogue. It has been my experience that the academic mindset and excessive focus on research rather than utility often outlive brief stints in academia. While such attitudes are understandable right after leaving academia, they must not continue indefinitely.

LaTeX

The desire to write any documentation in LaTeX is also a key characteristic. Whether that can easily be integrated, maintained, or generated automatically does not matter. It is also of no concern if few on the team are actually proficient in its use.

Instead of writing good documentation, or any documentation at all, the rogue data scientist prefers to fuss over proper typesetting. Unless you’re in research, properly typeset mathematical equations are not essential.

There are of course exceptions: if you develop novel algorithms from scratch, LaTeX can be perfectly sensible, although perhaps equations ought to be embedded in the already available documentation without having to build a separate process to build and publish it internally.

Inflated Ego

Basic solutions that work, such as heuristics or linear/logistic regression, are only done as a last resort when results are demanded by the higher-ups or to get rid of an undesirable project. Such simple but effective solutions are frequently dismissed as being silly. In fact, it’s not uncommon to hear rogue data scientists denounce most problems in the industry they are passing through as easily solvable.

Yet all they produce are notebooks that fail to reproduce the results on anyone else’s machine or on the full data set; fewer than a quarter of all notebooks can be re-run and only one in twenty-five reproduces the original results upon re-execution. Still, that is somebody else’s problem, as we have already seen. They appear undeterred by the fact that the lack of reproducibility makes data science nothing but alchemy.

Résumé Padding

Tech as a Bucket List

The aim of choosing technologies is resume padding rather than suitability based on requirements. Ideally each project uses a different framework or a new collection of libraries. A rogue data scientist views technology as a bucket list, and they want to tick off every item on it, as that makes them more valuable. Unfortunately, few outside of machine learning can see through the scam.

The more people use a single framework inside a company, the more can help out in case of questions or issues. Most standard (deep learning) frameworks offer more or less the same functionality anyway. It’s therefore much better from a productivity and maintainability perspective to settle on a common technology for the majority of use cases.

That does not mean each initiative must use the same set of libraries and there is no room for exceptions or experimentation. Far from it. There is just little use in using multiple roughly equivalent technologies to do the same task, unless you’re trying to embellish your skills section on LinkedIn.

If ever rogue data scientists are asked to account for the lack of tangible results, they tend to blame the technology they are forced to work with: they are doing the best they can with the shitty stack they have. When technology is not in their crosshairs, it is of course the engineers’ fault: data engineers are too slow to implement data pipelines, and machine learning engineers butcher their brilliant models.

Cherry-Picking Tasks and Initiatives

Cherry-picking tasks to work on or lobbying to be put on high-profile initiatives is not beneath a rogue data scientist. If they can outsource routine assignments to engineers or offload repetitive jobs onto interns, they will, as long as they remain in control of any credit that may be due.

Pick-and-Mix Platforms

Rogue data scientists can have a detrimental impact on the success of machine learning as a whole. A single rogue data scientist in a team of only data scientists is often enough to turn the entire team rogue. Shadow IT is typically a consequence of a rogue data science organization that picks and mixes whatever it pleases with little regard for the integration into the overall enterprise architecture.

Since data scientists typically do not manage the infrastructure, the result is a fragile platform consisting of barely functional and scarcely understood technologies, lots of duplication, and siloed knowledge within the team, because certain rogues prefer one stack whereas others prefer another.

GPUs Everywhere

GPUs can undoubtedly speed up computations in ML. It is dubious everything must run on GPUs though, as the costs often outweigh any speed-ups. Nevertheless, rogue data scientists want high-end GPUs in their laptops to ensure they can list that on their CVs. It’s of course cheaper to share GPUs in public clouds or on-premise clusters, but a rogue data scientist tends to be territorial.

What to Do?

At this point, you may wonder what, if anything, you can do if you have spotted a rogue on your team. There are two distinct cases: rogue by default and rogue by choice.

Rogue by Default

Data scientists may be rogue-ish because they are fresh out of college or they have just exchanged academia for industry. People gravitate towards what they are already familiar with, so if you have only ever built fancy models on your own laptop based on relatively clean data sets, choosing whichever technology you wanted, while writing research notes in LaTeX, it’s only natural you end up looking like a rogue.

The idea of a rogue data scientist as the norm has also been fuelled by excessive media hype. Back in 2012, Harvard Business Review declared data science the sexiest profession of the century, only to call it tedious a mere one-and-a-half years later. Perhaps HBR was a bit premature in their proclamation, but the damage had already been done: it set a generation of data scientists up for a career as rogues.

If the rogue therefore does not know any better, it is best to embed the person in a cross-functional product team of data engineers, data scientists, machine learning engineers, software engineers, and a product manager. They will soon pick up best practices from their immediate neighbours: data and machine learning engineers. Similarly, their unique abilities will rub off on the people around them, which is great for everyone involved.

Inside a product team, they also cannot double as product or, worse, project managers, for which they do not automatically have the requisite qualifications. Why companies sometimes believe data scientists have these abilities anyway remains a mystery.

Rogue by Choice

If the rogue among you has been out of a research environment for quite a while but they have no desire to adapt, they are rogue by choice. Such people are the largest risk, as they can contaminate an entire data science organization when left unchecked by management.

It is obvious that their managers ought to explain the basic rules of their employment: remuneration in return for AI-infused products that benefit the business, not a set of brittle prototypes to fatten their CVs. Of course that may not be sufficient to convince rogues to change their ways, in which case: good luck!