Lean Data and Machine Learning Operations
Ian Hellström | 13 December 2019 | 17 min read
Lean Data and Machine Learning Operations (D/MLOps) is the adoption of the ‘lean’ philosophy from manufacturing. Its aim is to continuously improve the operation of data and machine learning pipelines.
While the foundations for lean manufacturing had been laid by Frederick Taylor and Henry Ford in the early twentieth century, the principles were not widely adopted until after the Second World War. Japanese businesses, such as Toyota, were the vanguard. What does lean manufacturing have to do with data science and engineering, you ask? There is still a lot we can learn from best practices across industries, especially with regard to a structured approach to continuous improvement in operations.
Lean software development, which itself borrows ideas from lean manufacturing, shares many values with the agile paradigm; both champion an iterative, customer-oriented approach to software development, in which it is preferred to deliver quickly and delay (irreversible) decisions as much as possible. Rather than considering all data and machine learning (D/ML) development activities, I shall zoom on in data and machine learning operations (D/MLOps). With that restriction in mind, we can lift most concepts without too many mental gymnastics, if we see the assembly line as the process by which we create data or insights: a D/ML pipeline. The (intermediate) product is therefore the data itself.
‘Kaizen’, or continuous improvement, is at the heart of many lean ideas. It aims to better a business through standardized processes that are designed to eliminate waste.
While many software professionals balk at the introduction of ‘process’, a certain amount of it is usually warranted, and in the interest of any business. In fact, in the guise of CI/CD there usually already is some standardized (deployment) process in place. In operations, standardization and automation are key to efficiency.
Many methods from lean and six sigma are analytical and thus come naturally to developers. For instance, DMAIC (define-measure-analyse-improve-control) and PDCA (plan-do-check-act) are essentially standardized recipes for experimentation, debugging, and monitoring. The ‘five whys’ are another standard interrogative technique used in continuous improvement scenarios to identify the root cause of a problem. I already spoke about that in a post on product management principles.
To avoid preaching to the choir, I shall instead focus on different kinds of ‘waste’, the 5S methodology, a few related concepts from lean manufacturing, aesthetics, and martial arts, and how they relate to lean D/MLOps.
Waste conjures up images of smoke billowing out of industrial chimneys or toxic sludge seeping into rivers. Such waste is visible, but there are different kinds that are harder to spot. Lean identifies different types of waste. Waste is essentially anything that does not add value for the customer.
Within the D/MLOps practice, there is an opportunity to apply similar principles, especially since large pipelines that run on a schedule can squander resources without that waste being obvious.
Constraints or impediments that cause waste are called ‘muda’. Muda shows itself in various guises I shall henceforth discuss.
No value is added when data is moved, either digitally across the network or physically on hard or flash drives, although the latter is fairly rare as an integral part of a data or machine learning pipeline. If data is physically in transit, the same concerns as in manufacturing apply: there is a risk of damage. Transportation is a form of waste, although it may be unavoidable.
In architectures that span the globe, data can live in different data centres, which means that network transfer can incur significant costs and delays. This ought to be reduced as much as possible, as it adds no value to pipelines. They are a burden on the network and the company purse.
If the data needs to be in multiple geographical areas because of disaster recovery or compliance, D/ML pipelines need to ensure they operate on the data in the appropriate region(s). Note that I did not include performance as a reason for multi-regional data replication (e.g. with CDNs), as such strategies are part of the application, not the D/ML pipelines or operations thereof.
Code bloat wastes network resources too. Modern distributed data processing frameworks do, whether they are executed on on-premise hardware (e.g. Hadoop’s MapReduce, Spark, Flink) or in the cloud (e.g. AWS, Azure, GCP, IBM), move the code towards the data. D/ML pipeline code typically lives inside container images that can easily grow to hundreds of megabytes. Many such images end up inside company container registries, from where they are pulled on demand by workflow schedulers or orchestration tools. A strict interpretation of muda would view the network transport of the code (as well as the additional resources needed to create such images in development) as waste, but it is a trade-off many in the industry accept. Still it is a good practice to ensure the base images do not contain unnecessary dependencies. One way is to use distro-less base images (e.g. with ephemeral containers for debugging) or a minification application. Another is to ensure there is not one base image for the entire company, especially when different programming languages are employed: a Python-based ML pipeline does not need to rely on a base image with the entire JDK included.
The lack of an operations manual can cause undue friction. Such a manual lists the issues that occurred by type or how they manifested, resolutions, any relevant code, as well as when they happened. That way, on-call colleagues can resolve any recurrent issue as quickly as possible by not having to thumb through a chronological list of everything that ever happened. At the same time, it’s easy to routinely review whether any problems come back time and again. Issues that pop up regularly are ideal candidates for additional automation too, or even better: code and/or architectural improvements.
Any gap in automation often shows itself as multiple similar scripts that exist within the team. This prevents a standardized way to deal with routine tasks from being developed and therefore increases friction. If a script performs a routine task, it’s best to make it available to the entire team. After all, if it’s useful for one person, it’s most likely useful to more than one too.
Another possible impediment in D/MLOps can be access to systems. I have personally fought with setups in which access to internal tools is only possible after logging in through several systems (e.g. VPN > RDP > Citrix > SSH). This may be done with the best of intentions and all in the name of security, but it’s a constant drain on productivity (and morale) of the team, and it increases the time to mitigation (TTM) of every single issue.
Related to that is the lack of proper infrastructure. As soon as team members grab flash drives, share private Dropbox links, or use shared network drives to store compiled libraries or container images, there is a serious gap in the infrastructure that adds unnecessary friction—and waste. It’s also a potential security risk.
Unreliable software or infrastructure is a further source of friction, and therefore wasted resources. There are people who claim that many machine learning efforts fail, because data engineers spend too much time fixing brittle pipelines. As a rule of thumb, if D/ML teams spend more than 10% of their time on reactive tasks, they ought to improve the reliability of their platform. It is essential to familiarize yourself with the basics of DevOps and/or site reliability engineering (SRE), the kihon (基本) of operations if you will, to borrow a term from Japanese martial arts. Ignorance is no excuse, but neither is inaction. In the words of Yukio Mishima (Runaway Horses):
To know and not to act is not to know.
Any code or data that is not used in D/ML pipelines constitutes a waste. This includes pipelines that are disabled permanently. If these are to be deprecated they must have a deletion date that has been agreed upon or at least communicated to all stakeholders ahead of time. Without such a final operational date, they may stick around forever, which is a potential drain on maintenance and support.
Software may not suffer from a short shelf life, but data definitely has a timeliness dimension. Data that lies around in storage without anyone using it is wasted and almost certainly incorrect. That is part of The (Unfortunately Realistic) Rules of Data Quality™ philosophy. Storage may be cheap, but that is no reason to generate data ‘just because’.
Any data in development and staging environments beyond what is actually needed is a waste of resources too. The same goes for code that is not used anywhere, as it increases the burden on maintenance. That does not mean any code or data that is about to be used but is not yet ready for prime time (i.e. in alpha or beta) is a waste. Such code serves a purpose.
Abandoned and partially done work is wasteful too, and it often points to context or priority switches on the DevOps team. This can become visible as, for instance, not-quite-completed D/ML pipelines, test suites with low or no coverage, hacky scripts, or half-implemented solutions.
Container registries and company-internal software repositories also count as inventory. While it’s essential to version artifacts and images, there is little use in keeping fossilized ones around.
Let’s not forget about backups! They may be needed for regulatory compliance, but unless you perform regular drills that prove you can actually successfully recover from backups, preferably in an automated fashion, backups can be a significant waste of storage space and the costs associated with it. Such disaster recovery exercises are comparable to katas (型) in martial arts, and, as in martial arts, crucial to full mastery. Note that production is not an acceptable safe space for such drills. A separate environment that is similar to the production environment is needed, not unlike a dōjō (道場) for practice.
Code that loads a while before it processes the first bytes is also a form of waste. It may be unavoidable (e.g. in a distributed data processing framework, where all the workers have to register with a master), but in the case of D/ML pipelines it can also indicate a mismatch between development and operations. Any D/ML pipeline that is executed, say, every fifteen minutes, and runs only for a few minutes at a time, may be better suited for streaming rather than batch processing, or the re-run interval is chosen too small. This may be due to misconfiguration or gradual changes in the requirements that have never been addressed properly.
Many orchestrators run workflows on a fixed schedule, typically with a fixed number of retries and possibly with exponential back-off to retry with larger and larger periods of inactivity in-between. Upstream dependencies are checked regularly even though they may not be available. Regular checks can be avoided by a trigger-based system: once an upstream becomes available a notification is sent to downstream dependencies. If such a downstream pipeline has received all the information about availability of upstreams, it can request resources from the scheduler or queue up. The hidden cost of polling can thus be avoided.
Slow start-up times of components critical to D/MLOps are another form of waste. Ideally there is a single D/MLOps dashboard with all vital statistics in a location that’s clearly visible to the team. If team members have to piece together the health of all constituents from multiple applications, it adds to the time it takes to discover and resolve issues. Such a dashboard ought to include the performance of various production ML models as a function of time too, to be able to assess drift, especially when using automatic model deployments. Just because the latest model has been deployed and its status is reported in a healthy green does not mean it’s performing well for end users.
It’s simple but there is no need to produce more data beyond what is needed. If there is a requirement for an aggregated data set, do not provide a low-level one, too. If data is only sensible after a certain period, for example after the entire experiment has finished, do not produce intermediate data sets. That way you also make it impossible for people to peek.
Without a lack of coordination among autonomous teams in large organizations, it’s possible to produce the same data in slightly different formats multiple times. Or even worse: the same data but with different pipelines.
Data from monitoring and profiling can also be wasteful, especially when that is not used in alerts or if the alerts cause too many false positives. It’s very easy for a team to become desensitized to the flashing of a red icon when it happens too often. Unless the checking of data distributions is done as a part of a regular day’s work of the team members it’s not going to add any value.
Design beyond requirements is another form of waste. ‘Just in case’ features can be the result of a lack of involvement from stakeholders; the development team had to make decisions based on their best guesses. Once it becomes clear these features are not needed, they should be axed, as they add to the operational overhead. It’s best to take a minimalistic (and agile) approach, for you cannot predict future requirements anyway.
Wabi-sabi (侘寂) is a concept from aesthetics that accepts imperfection and impermanence. This ties back to 2BIG. Do not overdesign today because of what you think tomorrow will be like, because tomorrow will likely have new and possibly conflicting requirements. It fits in neatly with the agile paradigm in that it’s better to have fast and early feedback rather than after a lot of upfront work that may be obviated by the time the customer sees it. Build small solutions and add more later, if need be.
Sometimes extra features are the result of poor architecture decisions. While team autonomy is great in a microservices architecture, as the organizational structure matches the systems’ communication boundaries, a lack of cross-team coordination often stands in the way of large-scale data integrity. Such coordination can be achieved by means of a data governance function, which ought to drive the standardization of technologies but most importantly data: nomenclature, schemas, schema evolution strategies, versioning, access policies, and so on. Note that sustained improvements in the quality of data require the commitment from the entire organization, particularly from its leadership.
Sharing knowledge is essential to continuously improve a team. As a leader, make sure you know your team’s skills, both overt and hidden, and let people use theirs to the fullest. This is not only good for the engagement of each individual, but it also cements trust between teams and leadership. One of a leader’s main tasks is to enable to team to excel and what better way than to allow every person on the team to shine.
If one person has a knack for, say, stand-up comedy, encourage them to share the team’s work in a fun way with anyone who is interested. If another person enjoys writing, find opportunities for them to create engaging documents and posts that highlight the teams efforts to spread awareness about data within the organization. Do you have someone who would like to become more competent in a different technology or component of the stack? Ensure they have the opportunity to work on projects that allow them to safely experiment and learn.
That does not mean you must pigeonhole people. For instance, a former teacher who switched careers may not want to become the designated person to create onboarding guides or teach courses. Similarly, the person who spent a few years in medical school before switching to software engineering should not be forced to become the first-aid responder. This is where it’s important for leadership to listen to the team and design roles around the talents of each individual. Organizations have remarkable flexibility in how work is distributed, if they only took the time to think beyond the confines of standard job descriptions.
Defects in the code are probably the most familiar kind of waste in D/ML engineering. This also includes bugs in tooling that the team may not have direct control over and require workarounds.
Any waste that can be avoided by means of standardization is ‘muri’. For D/MLOps, this is anything that causes manual labour that can be minimized with an operations manual or automated, for instance with code generation for boilerplate code or infrastructure-as-code tools for environment configurations.
Service discovery is common for backends, but code, data, and model discovery also benefit from such an approach, as it ensures people have access to the information they need without having to ask around all the time: it promotes transparency by design. A company-wide registry with all data sets, including data access request forms, machine learning models, backend services, software libraries, and so on, can be a worthwhile investment. This registry must therefore be searchable and the main point of entry!
‘Mura’ is any kind of waste that is due to inventory that can be avoided by switching to a just-in-time (JIT) strategy. In D/MLOps this can be as simple as switching to ‘serverless’ environments: with managed services you typically pay for what you use, so you do not incur costs for idle machines. This of course needs to be weighed against the requirements of the business: if an on-premise data centre is a must, then a switch to the cloud is not an option.
Regardless of the infrastructure, there is mura inside most schedulers. Workload-agnostic schedulers (e.g. cron, Oozie, Styx) do not care what goes on inside programs or containers. They kick off (batch) workloads according to a fixed schedule. I already talked about how that can cause muda, but there is also a possibility for mura.
Schedulers like that do not monitor executions, which means they have no need for execution statistics or inferred capacity constraints. What if you could tell the scheduler you need the data at the latest by a certain time each day, and it figures out when best to run based on intrinsic constraints that may shift by hour or even day of the week? Such an optimizer would enable JIT data delivery: no jobs are executed until the necessary source data is available, the resources to run the job are sufficient, and the job can still be delivered on time without endangering jobs of higher priority. There is no need for servers to sit idle in a chain of dependent workflows; the scheduler is capable of redistributing the workload due to any delays of upstream sources. It’s essentially kanban (看板) for D/ML pipelines: it reduces the amount of work in process.
The semiconductor industry employs such schedulers, which either use sophisticated optimization algorithms or heuristics based on extensive experience. The reason they use workload-aware algorithms is that the fabrication process is complex: there are typically hundreds of processing steps, each of which can take from a few minutes to many hours to complete; each step can be processed on any of a group of similar machines, but each choice may introduce long-range constraints based on what machine can be used in a future layer of the integrated circuit’s fabrication.
Now that you may have identified different kinds of waste in D/MLOps, how do you minimize waste? 5S is a technique to deal with waste by means of five steps:
- Seiri (整理): sift through the code and data and remove all unnecessary components, including any superfluous dependencies.
- Seiton (整頓): straighten out everything by stripping down to the essentials.
- Seisō (清掃): sweep regularly, that is, schedule it.
- Seiketsu (清潔): standardize the sift-straighten-sweep process by scripting it.
- Shitsuke (躾): sustain the effort by automation and/or reminders, and regular audits.
Some companies apply 5S to office workers too by instituting clean-desk policies. While I can understand the reasoning to make industrial espionage harder, all it amounts to in most instances is a lot of time wasted by moving belongings back and forth every day. Any benefits must be weighed against the loss of productivity. Moreover, clean-desk policies remind me of the quip:
If a cluttered desk is a sign of a cluttered mind, of what, then, is an empty desk a sign?
Let’s now turn our attention away from kaizen with its laser-like focus on eliminating waste, and instead look at more parallels between D/MLOps and lean manufacturing.
In manufacturing it is often beneficial to go to the shop floor, or genba, as issues may become clearer. I recall an issue with quality control on an assembly line that popped up somewhere between August and October every year. There was an annual pattern but it was not exact, and it only affected a single plant and only one assembly line. A lot of numbers were crunched, many hypotheses were tested, but no solution was proposed that made sense. It turned out that there was a window near the assembly line. It was open all summer because of the heat but closed when it got too cold. There was, however, a period in-between the summer heat and autumn chills, where it was warm but wet, so the window would stay open. During that time the humidity would seep into the assembly line nearest to the window, causing quality issues due to adhesives not sticking properly. The onset shifted every year because of the weather; that much was clear from analysing the data, but not so much how that would have an impact on the product quality. Without going to the assembly line and seeing for ourselves, the data would never have told us that.
In terms of D/MLOps, this does not mean you have to visit your data centre, but rather you can see how the data is used and in what context. It’s often insightful to see how a data set is misinterpreted or even tortured to answer basic questions that can be solved more easily.
For operations it can have a profound impact on the prioritization of pipelines. For instance, if users absolutely must have certain data available by a certain date or time (e.g. for the closing of a fiscal quarter), it may be necessary to tweak the scheduler or raise alerts earlier for upstream delays. Another example is when the deployment of machine learning models can have unintended side-effects, especially when there is a feedback loop, such as bad recommendations that cause fewer sales, which in turn causes fewer decent recommendations to be generated. Monitoring multiple systems and understanding how they are wired up is crucial to maintain the health of an entire suite of applications.
Whenever there are problems, all stakeholders ought to be informed immediately and automatically. As soon as it’s up to someone to type an email with details there will be an unnecessary delay that can cause further issues. While safety lights may be for dudes, according to Jillian Holtzmann in Ghostbusters, it’s usually best to err on the side of caution. Andon is such as system in manufacturing: a clear notification signal that informs of issues on the assembly line, or in our case: a D/ML pipeline.
Imagine there is a bug in some pipeline that produces data that is read by many other dependent pipelines. The bug is discovered after a partition of bad data has been produced. An email is sent notifying data consumers of the issue, so they have to backfill their data sets. In most automated scenarios, the buggy data will cause all downstream pipelines to produce their data sets, based on the wrong data. It’s possible downstream consumers can halt their pipelines, but they then have to remember to re-enable them afterwards. Time zones may also prevent people from reacting quickly enough to disable pipelines temporarily. There is also the risk that some consumers do not notice it, for instance because they are not paying attention to the correct channels or they missed the significance, because their pipelines are several times removed from the source of the issue. This means the pipelines are run twice, or worse: once but with the incorrect data.
That could have been avoided if an orchestration layer is aware of issues that imply any downstream pipelines need to be halted until the issue has been resolved, after which they can be automatically restarted. That way, the data will be late, but at least not incorrect. In addition, it eliminates the waste that comes from producing bad data down the line.
Such an alert obviously needs to be visible to consumers too, so they understand why their data is not being produced. Automatic deletion of bad data could be configurable for each pipeline, which means it is both flagged externally and handled internally as soon as any issues have been noticed.
Lean Data and Machine Learning Operations (D/MLOps) is about the continuous improvement of operations by the systematic elimination of waste through the introduction and continual application of standardized processes. These principles have been inspired by similar ones used in manufacturing, where they have proven to be effective.
Waste in D/MLOps can arise from various sources:
- Architectural decisions (e.g. network transfer due to data being stored in physical locations different from where the processing happens);
- Design over requirements (e.g. code bloat, extra features, or overly complex models);
- Complex or insufficient infrastructure (e.g. multi-login access to critical systems, or no artifact repositories);
- Lack of coordination (e.g. duplication of code and data sets, or no operations manual);
- Manual intervention (e.g. lack of automation, or no operations manual);
- Propagable bugs (e.g. no automated way to halt and restart downstream pipelines upon identification and resolution of data-related issues);
- Improper utilization of people’s talents.