The Case for Industrial Data Science
Ian Hellström | 1 February 2016 | 14 min read
It has – perhaps somewhat prematurely – been called the sexiest job of the twenty-first century, but whether you buy into the Big Data hype or not, data science is here to stay.
The available literature, the majority of courses in both the virtual and real world, and the media all purport the image of the data science ‘artiste’: a data bohemian who lives among free, like-minded spirits in lofty surroundings, who receives sacks of money in exchange for genuine works of art created with any possible ‘cool’ tool that flutters by in whatever direction the wind is blowing that day.
The reality for many in the field is quite different. Corporations rarely grant anyone unfettered access to all data, and similarly they are not willing to try and buy every new tool that hits the market, simply to satisfy someone’s curiosity. Furthermore, industrial data science has requirements that are much stricter than what is commonly taught in programmes around the world, and it’s time to make the case for industrial data science.
Data science is not new. It’s simply the modern buzzword for what used to be called business analytics, statistics, machine learning, predictive analytics, and so on. In books alone, the appearance of the term ‘Data Science’ has increased steadily since the 1980s.
Sure, you could argue that data science encompasses more than merely statistics or simply machine learning, and I wholeheartedly agree. However, data science is the convergence of modern incarnations of the aforementioned fields.
Some of the aspects of what I would like to call industrial data science – as opposed to ordinary data science, that is, the sanitized version they teach in courses at universities and MOOCs where ‘data cleansing’ and ‘communication skills’ are stressed as being important but then quickly glanced over as they are hard to examine and even harder to teach – are similar to the four types of data integration challenges that I identified on these page almost two years ago: technical, project management, people, and data governance.
To anyone who has read the tetralogy on data integration, it may seem that the case for industrial data science is very similar and only differs in the specifics of the technical aspects. In fact, the technical aspects almost pin down some architectural decisions, which is not the case in data integration, where the technology stack is pretty much open and the number of options to choose from considerable. Moreover, I want to make the case for industrial data science because I firmly believe that it is different, perhaps more constrained, than run-of-the-mill data science.
Industrial Processes: Quality and Safety
Production quality, or its reverse failures, and product safety are the main drivers of industrial data science.
Quality, cost, and delivery (QCD) are central to the operation of many industrial corporations. In manufacturing, for instance, quality is not just about providing customers with high-quality goods, it’s about ensuring that products are safe to use. This is especially true with safety-critical components, such as airbag control units, jet engines, and cardiopulmonary bypass pumps.
In manufacturing, failures nowadays happen in the so-called six sigma region, meaning that a few parts in a million are defective and need to be scrapped. Because many of the components that go into the products are expensive, real-time early-warning systems are needed. Sure, it’s expensive to scrap parts in production but it is even more costly when failures occur in the field, where they can potentially have deadly consequences. Recent examples of such field failures include GM’s faulty ignition switch and Takata’s airbag inflator.
To ensure that any analytics model captures failures correctly and in a timely manner, every single relevant event needs to be captured and processed in the correct order. For complex event processing (CEP) or real-time scoring engines, this means that all pertinent events must be guaranteed to be delivered at least once, preferably exactly once. At-most-once delivery is simply not good enough.
With for instance sentiment analysis based on tweets, your model does not really suffer if an occasional event (i.e. tweet) is delivered in the wrong order or is dropped altogether. The simplest mode, at-most-once delivery, is already good enough. As long as most tweets arrive the model will be fine. This is not the case in industrial settings.
In the Hadoop ecosystem, this means that Apache Storm with Trident and Apache Flink as stream processing engines are all options, as they both support exactly-once delivery and can handle out-of-order events. Samza is possible but does not yet support exactly-once delivery. Spark Streaming is not really an option. It is possible to process streams with exactly-once guarantees, but that presupposes Apache Kafka. Moreover, Spark Streaming still processes micro-batches rather than single events. Graphical end-to-end solutions that satisfy the requirements include Apache NiFi and StreamSets.
Accuracy: Trade-Offs and Level
When introducing a model with tuned hyperparameters or just a different, more accurate model, the trade-off between false positives and false negatives often leaves some wiggle room. This trade-off is typically examined by means of the ROC curve or even a collection of confusion matrices for different cut-offs.
In models that are supposed to detect failures, more false negatives cannot be permitted. A false negative is a part that ought to have been scrapped but was flagged as OK and thus moved on to the next processing step. More false positives are not ideal either, because additional resources need to be used to analyse the potential failure, but at least it does not represent a risk that cannot be tolerated.
Consequently, redundancy is also a critical ingredient: there must not be a single point of failure in the data flow. In case the real-time scoring or CEP engine is unavailable, there either has to be a backup that kicks in immediately or the model must revert to a default solution that is good enough.
The level of accuracy is also quite different in ordinary data science. The Netflix Prize awarded $1m to the data science team that beat Netflix’s own user rating predictions by 10% (in terms of the RMSE). In the parts-per-million (ppm) arena, such improvements in accuracy for mature analytical models are almost unheard of. Whereas a huge effort to improve the accuracy by a fraction of a per cent may not be warranted in Google AdWords campaigns or Netflix recommendations, such an improvement may save loads of cash in industry.
Architecture: Streaming and Batch I/O
We have already looked at streaming data, but typically corporations also have plenty of data at rest inside database warehouses and ERPs. Master data is a classical example but it is not the only data that can be found inside RDBMSs. In fact, some data may not be available as streams, for example, read-only machine logs or certain data sets from a manufacturing execution system (MES). These systems haven’t historically been designed to satisfy the current appetite for data.
Consequently, a Lambda architecture is not ideal: batch and stream data are handled separately, which means that the development and maintenance efforts are doubled. Moreover, there is no direct link between the batched data and the streams, so that the CEP engine cannot make use of what’s in, say, the master data. The reverse is also true: the batch storage does not know anything about the logic of the CEP engine, so that information is lost as soon as it’s had its fifteen minutes of fame on a real-time dashboard.
The Kappa architecture at least enables the link between the CEP engine and the batch storage, by means of the message bus or enterprise service bus that feeds both. Nevertheless, analytical models still need to be built for both streaming and batch data.
As an alternative, there is the Zeta architecture, which unifies the technology stack. It comprises:
- a distributed file system (e.g. HDFS).
- a real-time data storage (e.g. HBase).
- a computation / execution engine (e.g. MapReduce, Spark, or Drill)
- a deployment / container management system (à la Docker)
- a solution architecture.
- enterprise applications.
The idea of schema-on-read (i.e. HBase and NoSQL) or even schema-on-the-fly (e.g. Apache Drill) is a utopia in many corporations. A considerable cleansing and integration effort is required. Whether semantic modelling provides a way out of the effort remains to be seen.
In addition, SaaS is not an option: the network traffic would immediately become a bottleneck. More importantly though, few industrial companies want to send their data outside of their corporate firewall. As such, only on-promise solutions are possible.
In itself, these considerations do not limit data science in any significant way. However, not everything is plug-and-play. Some of the most popular open-source libraries in R and Python have almost zero support for MapReduce or Spark. Yes, there is MLlib, but it is not nearly as complete as what Python and R have to offer.
What is even more critical is that some open-source solutions come with limited or no (commercial) support, which is not something many large corporations are too happy about. Related to that is that these solutions rarely offer the whole experience: a reusable model repository with solid built-in documentation capabilities that comes with one-click deployment seems to be mainly available in commercial suites, such as SPSS or SAS. Alpine, RapidMiner, or KNIME all offer a part of the solution but not the whole package. This DIY mentality is fine and perhaps even encouraged in start-ups and academia but many industrial companies have a hard time with what some perceive as a willy-nilly attitude towards data and software.
Continuous Integration and/or DevOps
This leads us naturally to continuous integration (CI) issues. In production environments, continuous integration and DevOps common practice. Analytical models that have been developed for one facility or plant may have to be rolled out (and maintained) to other locations. These may have similar requirements but that may not necessarily be true; not every company builds their manufacturing plants according to Intel’s Copy Exactly philosophy. Hence, a platform and deployment standards are required. The platform ought to be capable of simple re-training of the models and running several versions in parallel for comparison purposes.
As much as I like R, CI and DevOps are nearly impossible with it.
Sure, you can export the models to PMML and run those in an execution environment that’s the same everywhere.
Dependencies can, nevertheless, throw a spanner in the works, although
checkpoint by Revolution Analytics is an important step towards deployable and maintainable R code.
Apart from that, R is limited by RAM on the machine; with SparkR you can leverage Spark’s capabilities from within R, but the trade-off is that CRAN is not the package source but rather Spark (with MLlib).
You could argue that data scientists and data engineers account for the distinction: the former come up with the ‘creative’ ideas and the latter do the professional implementation. However, as data science matures within companies I doubt that these companies want to stay in single-project mode forever, especially since that can create an atmosphere in which the data scientists become the frivolous artists and the data engineers their impresarios. I’ll have more to say on business processes in a moment, so bear with me on this one.
Impact on Production
Closely related to the architecture is the impact a solution has on production. Ideally, the impact in terms of performance degradation is negligible. Many legacy systems were never designed to be accessed from the outside continuously or even at all. It is therefore possible that these systems are affected negatively by connecting modern systems that voraciously consume data. In the case of manufacturing execution systems, the performance impact can be disastrous. Any increase in the cycle time of products, simply because the MES has to deal with an additional load, is unacceptable. I do know that this objection is typically flung into the room to fend off what may be perceived as data raids, but it is a fair concern that needs to be addressed. Industrial data science solutions need to be minimally invasive yet operate with maximum (positive) impact.
Even if you buy into the data scientist/engineer divide, data scientists cannot access data with impunity. They too have to be mindful of performance considerations when connecting to live systems. This of course is just another argument that the line between data scientist and data engineer in an industrial setting is not very clear. In fact, industrial data scientists have to be able to deal with such concerns independently.
Data science in most its guises is currently still in what I call single-project mode. A problem with potential is identified, a project is initiated, and if the project is successful, a permanent solution is developed. That works for many modern organizations, but in an industrial setting, where much of the work has been automated or is at least following a script, that won’t be a long-term solution. Repeatable, well-documented business processes are paramount. Hence, at some point industrial companies need to mature from project-focussed to process-oriented data science. A framework, such as CRISP-DM or SEMMA may help when doing projects, as that at least ensures that there is no variation in the way the projects are done. More importantly, it allows projects to be handled over to operations with proper documentation of each step; I personally recommend CRISP-DM as it consists of the whole data science life cycle from business and data understanding to evaluation and deployment.
Such framework standardization is basically a baby step towards the industrialization of data science within an organization. I am well aware that some data scientists do not like the idea of formalizing the entire process, but that is a crucial component of industrial data science.
On a somewhat related note, there is sometimes an impedance mismatch between the data and business teams with regard to how ‘projects’ are done. Some organizations have not yet embraced agile and stick with classical project management, which is perfectly fine. However, data science teams often work in an agile fashion. In itself that is not a problem, but it can cause duplication of work: all the tasks have to be managed in JIRA, or a similar tool, and typically an aggregation of tasks needs to be maintained in PM software, for example Microsoft Project. One solution that I have discovered that can make life easier in similar situations is JIRA Portfolio, which sits atop JIRA and provides a multi-project view replete with Gantt charts, resource management, forecasting, and tracking. If you can make the tedious tasks somewhat more bearable to the team, it’s probably worth the investment.
Process and Domain Knowledge
Domain knowledge is obviously crucial to data science. It’s what makes feature engineering and selection so much more effective. However, an industrial data scientist must know not just about the processes and related control technologies (SPC/APC and run-to-run), but basic knowledge about physics, chemistry, engineering, biology, pharmacology, or medicine may be required in certain situations too. On top of that, in order to be able to talk with process engineers effectively, knowledge of automatic optical inspection (AOI) technologies, defect engineering, FMEA/FTA, Six Sigma, drug approval processes, and many more is a must.
Because of this complexity, analytical models can easily have tens of thousands of variables prior to feature engineering, especially when data scientists perform analyses across the value chain. A single product may be assembled from tens to hundreds of individual parts, all of which have typically a few dozen process steps each. In case ICs and/or on-board sensors are involved, the number of steps for the IC alone is typically a few hundred, each with single measurement values, time series, and images of (microscopic) structures.
Although Facebook may be able to experiment on its users with indemnity and while Uber can launch a service that appears to be against the law in some countries, industrial companies have to abide by strict laws to ensure the safety of their products. You may not believe that based on the recent Volkswagen scandal though. Nevertheless, regulatory compliance is not an option, and familiarity with applicable laws is a must for industrial data scientists.
Since we’re on the topic of processes, I firmly believe that industrial data scientists need to embrace process mining, which is a fairly young discipline. It allows process flows to be discovered (i.e. learned) from raw events, which in turns allows deviations from the norm or a pre-defined standard (i.e. process compliance) to be identified. Similarly, graphical analyses with Neo4j or GraphX will become more commonplace in industry, as it’s a natural framework to analyse value chains and study the genealogy of product components.
The Minefield of Corporate Politics
Communication is often mentioned as the key skill that separates good data scientists from the truly great. It’s true, but that is but one side to the story. The other is that great communication skills still don’t get you anywhere when the organization resists, and there will be plenty of cases where a data scientist’s persistence is futile. Sometimes your hands are bound. An example that springs to mind is employee monitoring, which quickly lands you in hot water with the law or trade unions. No matter how smoothly you talk or how great your idea really is, no sometimes really means just that: no.
That leads me to Tim Gunn and Dr Phil. Yes, really, it does. In Season 4 of Project Runway, Tim Gunn used an analogy for designers and bad fashion that applies to data scientists and their ideas and/or models too. You may think what you have is the best invention since sliced bread, because bla bla and bla, but you may have been in the monkey house for too long: you don’t enter (like everyone else) with a fresh nose to be repelled by the stench, for you have already grown accustomed to it. In such cases, pushing for your idea aggressively may hurt your future chances of success. ‘You teach people how to treat you,’ as Dr Phil quips. Sometimes you need to let go, no matter how great you think the idea is. Sometimes it’s really not that great, or it’s simply not worth agitating people over or burning bridges. In a data-centric start-up you may be the rock star of the company, but in an industrial behemoth you’re just one of an army and sometimes merely cannon fodder. I know that does not sound very kind, but I think it’s time to defuse the time bomb of the almost baseless optimism surrounding data science and especially Big Data. Everything in moderation, please.
Anyway, dealing with corporate IT is definitely not ‘sexy’, to come back to the epithet. In fact, it can be downright frustrating. I’ve already mentioned the potential problem when interfacing with mission-critical systems, such as an MES, but beyond technical reasons, there may be political or even circumstantial ones. People responsible for different aspects (e.g. IT vs quality control) frequently have different agendas. Issues can and occasionally must be escalated to the executive level. It’s a special power that you should not wield too often, except perhaps in dysfunctional organizations where it’s the only weapon people yield to.
Expectations are also quite varied in industrial settings. Management would like reports on the performance aggregated across facilities, perhaps enhanced with analytical insights on the sales situation across the globe. Engineers need analytical capabilities baked into historical reports with full drill-down capabilities, but only for their area. The shop floor needs real-time visualizations with early-warning systems that provide timely alerts on phones, watches or smart goggles. Maintenance engineers need real-time data as well as predictions for maintenance events. People in operations need most of the information the data science models provide, so they can be scheduled, and their impact simulated before a final decision is made. Logistics needs predicted factory output and information from sales and marketing to forecast demand.
Managing expectations is of course not unique to industrial data scientists, but the need to integrate with a plethora of systems is not that common outside of industry. Not even data monsters like Google have such a huge variety of interlocking gears as most industrial companies. As an example of the complicated interplay of different divisions and systems, let’s look at predictive maintenance. Predictive maintenance does not just affect the team responsible for fixing the machines. It also impacts the shop floor and operations and thus logistics and eventually customers. Small maintenance events generally disappear in the overall variability. However, some events have the potential to create a (temporary) bottleneck situation that can significantly decrease the output for an extended period of time.
I’ve have mentioned the figure of 50-80% before, when I talked about the percentage of time taken for data cleansing and preparation. What is almost never mentioned and I also failed to note, yet definitely a part of the life of an industrial data scientist, is that waiting for the corporate cogs to align may take even longer than the time a data scientist needs to whip the data into shape. This is rarely something an industrial data scientist can control, but it is a reality that few are prepared for.
Without data governance, a data scientist is in a unique position to see the skeletons in the closet. Along with that comes the risk of becoming the designated data janitor because the data quality issues are more pressing. That risk is higher in an industrial giant than in a lean start-up, since ‘strategic re-evaluations’ are more common in the former than in the latter.
Depending on your perspective, industrial data science is either a specialization or an extension of regular data science. In some ways, it adds additional constraints, mainly organizational and architectural ones. In other ways, it requires a data scientist to be even more of an all-rounder with broader domain knowledge and technical expertise. Nevertheless, industrial data science exists, and it’s time we developed realistic programmes for it, because the people aren’t going to train themselves.
An abridged version of this article appeared on Datafloq.