6 July 2014 • Ian Hellström • 8 min

The Challenges of Data Integration: Data Governance (Part 4/4)

The fourth and final part of the series on the challenges of data integration is about data governance. Data governance is not so much a challenge as it is a critical component of continued success. Data integration is usually only the Band-Aid that is applied to a particular business problem. Beyond that it has the power to transform a business, but to do so you need to continuously guard, monitor, analyse, and improve your data and related business processes, so that the information you glean from it is always sound.

Data governance is an emerging discipline that pertains to business processes and dedicated roles related to all data assets, which demand formalized management practices throughout the enterprise as well as over the entire data life cycle in order to ensure the availability and integrity of said assets at all times. A somewhat simplistic definition is that data governance enables organizations to mature from dealing with data quality to driving the business with quality data.

No one goes from glitches galore to zero-fault tolerance over night. To assess the level of data governance at your organization you can use any one of the many maturity models. A nice summary of some of the most common maturity models is given by NASCIO. Maturity models are nice but they do not tell you how to progress from one phase to the next. That’s what data governance frameworks are for. A popular framework is the one by the Data Governance Institute (DGI).

Before you can even think about data governance you have to establish a significant level of data literacy within the organization. There is no guarantee that a data integration project leads to increased data literacy. There are broadly speaking two classes of data integration projects, each with its own path to data literacy: 1) isolated instances driven by a single division or a few departments, and 2) cross-divisional or even enterprise-wide efforts to consolidate data sources and standardize data.

In the first case you may end up with a snazzy data warehouse, but if it’s only used by few people, data literacy is unlikely to spread across organizational boundaries. The people who commission the work on the data warehouse are often the same people who already work with the source systems. They may well think the data warehouse is a huge leap forward but without adoption by others the level of data literacy will remain low.

The second type of project is often the launch of data quality management (DQM) competencies and a substantial effort to standardize architectures, data models, and business processes and policies that deal with data and its uses. Data integration projects of this kind are frequently the result of a company’s exposure to data-related risks, which have made it (nearly) impossible to ignore data anarchy any longer.

The reason I mention data literacy is that at some organizations few employees are trained to work effectively with their own data; some companies are even surprised to find out over the course of data integration projects what data is really stored on their systems. Furthermore, without data literacy you have no chance of raising awareness about issues with data and related policies. You need a critical mass of data literati to be able to tackle data quality issues comprehensively and raise awareness at the executive level who can then—and only then—decide on the strategic role of data governance within the organization.

There are certainly many facets to data governance, and I shall not touch upon each and every one of them. Instead I want to talk about two data integration topics that are related to data governance: accountability and standards.

Separation of Concerns

The separation of system responsibilities (maintenance and support) and business solution ownership (actual content) makes perfect sense from IT’s point of view: they deliver the infrastructure and that’s it. From a business perspective it is downright bollocks. It amounts to saying that a driver is responsible for the safety of the bus but not the passengers. Sure, if a nutcase happens to brandish a gun and shoots a fellow traveller, that’s not the bus driver’s fault. But accidents are most likely to happen on the road and not inside the bus, so the driver is indeed responsible for the safety of the passengers.

Admins are generally in the best position to judge whether certain rules make sense, as they (are supposed to) know the system inside-out. Administrators in business are responsible for the entire business processes they own, from data to decisions. So, why not ensure, even enforce that IT be actively involved in all data-related aspects of business-critical systems rather than their being absolved from anything that remotely has to do with the actual use of the systems they maintain? Engagement from IT of course presupposes that they are interested in what the business does with its data, which is by no means guaranteed, as I have had the following conversation time and again:

‘This data seems off.’

‘Possible.’

‘Do you happen to know why?’

‘No. I’m not responsible for the contents.’

‘OK. Do you know who is?’

‘No. I’m not responsible for knowing who is responsible for the contents.’

‘But you must know who the key users are?!’

‘No, I don’t care who uses the system.’

‘Even if no one uses the system?’

‘That’s right.’

‘Shouldn’t you at least know who uses it, so you can shut down the system and save costs in case no one needs it?’

‘I don’t make the decisions round here, so no.’

‘Great… Thanks for this educational chat.’

Accountability is not only important during the data integration project but also afterwards. All your efforts will be for naught if no one is made responsible for the quality of the data stored in the source systems and the data warehouse. The data cleansing procedures and integration logic depend on the input data and when—not if—anything is modified (i.e. data structures and relations, but more importantly contents that affect the overall quality), the data warehouse will suffer. Data custodians and data stewards are the next step in the evolution of admins and business solution owners, respectively. Having clear roles and responsibilities dedicated to guarding your business’s data assets is the only sustainable solution.

Apart from that, an important tool towards data governance are standards.

Sweet Merciful Standards

Few organizations have already reached the stage where they have defined standards for data governance, which enjoy full executive support. The vast majority are still struggling with their data and stuck in firefighting mode when problems with data quality arise. This means that data integration projects are likely to remain challenging in the years to come.

Some, however, have already started to think about data standards and even defined a couple. While you dance La Cucaracha with a celebratory Martini in your hand, please take a moment and ponder this: having standards and sticking to them are two completely different things. A standard can only be as good as its enforcer: if there is no enforcement, then a standard is mostly seen as a loose recommendation, and it’s up to the individual’s discretion to comply; self-regulation is extremely rare, especially at larger companies. What I am not advocating, tough, is policing data sinners: you don’t want to become a bunch of bureaucrats.

I observed one semiconductor manufacturer let its formerly high standards gradually slip because of its increasingly lax attitude as a consequence of its policy to (almost) never fire anyone, even for gross incompetence or negligence. When eventually there was a huge production problem leading to a recall of several hundred thousand safety-critical components, it took several weeks to establish a task force and gather all relevant data, at which point no one could be sure that the data they were looking at was actually correct; a confidence score would not have helped much since no one on the task force really understood the extent of their data problems, and even if someone did, the others would not have believed that person. For days executives raged up and down the corridors pointing to company standards that, as they were about to learn, no one had cared to enforce; compliance was practically non-existent. Fingers were pointed at people, mostly away from the owners of the respective hands, accompanied by muttered utterances à la “I was never told to do so” or “It’s not my responsibility.” This not-my-problem attitude had been manifest in the data integration project too, which proved to be a part of a larger pattern.

The same company had another problem with standards: one central production system was used for the control and execution at each facility on site, but each facility had decided on its own way of using it, even though there was a standard that described how the system was supposed to be used. The people who created the standard were the admins and they had no interest in checking compliance to the standard, as they considered it their duty to keep the system running, not make sure that the data contained was correct or usable; they had only defined the standard because they were asked to. Anyway, as we incorporated the data correction maps into the ETL logic, we noticed that it was virtually impossible to account for all possibilities and at the same time make it future-proof. The administration of process steps, a critical piece of information to the integration, varied from engineer to engineer: some enjoyed the technical aspects so much that they went into great detail while others abhorred the notion of having to type something into a computer, so they copy-pasted everything from ancient templates, leading to name collisions and the like. At some point we had actually managed to develop a robust solution but an influx of new engineers meant we had to add some corrections to our corrections. In summary, a common system does not imply the same use, which in turn does not imply that the same data structures have the same meaning.

Summary

As I have tried to explicate, data integration is not only a technical exercise. Aspects such as project management, people, and data governance are critical factors in the continued success of any data integration effort.

Data governance is the institutionalized programme that deals with all facets of data over the entire life cycle: a company’s data governance office is your organization’s Ministry of Data.

The challenges specific to data governance can be summarized as two negated implications:

System responsibility ⇏ content responsibility.
Standards ⇏ Compliance.

A corollary is that

a common system ⇏ identical data structures (i.e. denotation ≠ connotation).

Some challenges may be more apparent than others in your particular project. What always stays the same though is that you have to be aware of all factors that affect the data integration project and its continuing success. These include technical aspects, project management, people, and data governance. Focusing on merely one or two of the facets is not enough: by doing so, you tacitly accept more risk than you have to, or as Sun Tzu famously said in The Art of War:

If ignorant both of your enemy and yourself, you are certain to be in peril.

The Challenges of Data Integration: Data Governance (Part 4/4)

Separation of Concerns

Sweet Merciful Standards

Summary

ETL: A Simple Package to Load Data from Views

The Challenges of Data Integration: People (Part 3/4)

The Challenges of Data Integration: Project Management (Part 2/4)