The Challenges of Data Integration: Technical (Part 1/4)

Data integration is a formidable challenge. For one, data integration is never the goal of an organization. Similarly, a data warehouse is never the objective. It is merely a vehicle that can drive you to your destination. Data storage and integration for data’s sake are a waste of time, money, resources, and nerves. Without a clear business case, effective leadership, and strong support, including but not limited to a highly visible and respected sponsor, any data integration project is doomed from the get-go.

There are of course many more facets to data integration that make it seem Sisyphean. Based on personal experiences as both a project manager and developer in data integration and data warehousing projects I shall describe what I consider the main challenges. Many people have compiled similar lists: Peter Giblett, Loraine Lawson, Philip Russom, Pearl Zhu, and even the US federal highway administration to name but a few. People often focus on technical aspects when talking about the difficulties of data integration.

The challenges I have recorded are grouped into four categories: technical, project management, people, and governance, for data integration is more than a mere technical problem for which a purely technical solution can be found. I shall talk about each category in a separate instalment of a biweekly miniseries on the challenges of data integration.

First off: technical challenges.

Architecture

The technical side of data integration projects is probably the most obvious one. The architecture of the data warehouse is largely determined by the business requirements of the data warehouse and the (technical) context, which includes not only functional requirements, hardware, and software, but also regulations and standards specific to your industry: Do you need full semantic integration of all sources or is a domain-specific dataspace or even a data federation sufficient to satisfy your business needs? Is there an applicable reference architecture available for your industry? Do you have to store all data with a complete audit trail for a certain amount of time? What backup strategy is to be used, and with which frequency? Is there to be an SLA? And so on.

A (near) real-time data warehouse has completely different requirements than a data warehouse used to store data from legacy systems. For instance, a real-time data warehouse needs logic to deal with latency (in the data extraction, transfer, transformation, and loading) and provisions for when a connection to a source is lost (data recovery), or the data warehouse itself is unavailable due to maintenance or simply unresponsive (e.g. asynchronous message queues). Such systems are likely to use event stream processing and run in memory. A data warehouse used to store data from legacy systems, on the other hand, is probably loaded on an ad hoc basis: whenever the data is ready, someone presses a button.

Additional constraints that have to be factored in to the design of the data warehouse are applications, reports, interoperability, security, and permissions: Are certain applications expected to communicate with the data warehouse? Which applications and reports need to be adapted? Does the data warehouse feed information to other databases? Is everyone within the organization allowed to access the data from anywhere? How many concurrent users are expected?

Relational database management systems (DB2, Oracle, SQL Server, Teradata, …) may not provide the scalability that NoSQL/NewSQL databases can offer. The question is whether you are willing and able to sacrifice flexibility with regard to the ease of data exploration that SQL-based data warehouses provide. Solutions that adopt the NoSQL/NewSQL paradigm quite often require people to have extensive knowledge of programming languages, which for business users can be a definite showstopper. Similarly, when you opt for a monolithic data warehouse, an important question that is related to the physical location of the database is what amounts of data are expected to be transferred over the network in relation to the requirements of business applications that run on the data warehouse. You may not be able to obtain a database server close to your sources because of infrastructure restrictions imposed by the organization. In this case you need to account for latency in the design of your applications.

In itself, designing the architecture is not a challenge that is unique to data integration and cannot be met by a team of information technology experts; there may be constraints and/or requirements specific to the situation at hand, but nothing that your experts cannot solve by themselves based on the requirements.

No [wo]man is an island, and that is true of technology experts too. Where pure technical prowess is a necessary but not sufficient condition for success is data quality.

Data Quality (or the Lack Thereof)

When developing databases from scratch, you typically build the data model in three steps:

  1. Conceptual data model: an abstract representation of entities and their relations.
  2. Logical data model: the conceptual model with attributes and primary/foreign keys.
  3. Physical data model: all entities are translated into objects (i.e. tables with column names and data types).

To be successful at data integration you have to take two additional steps:

  1. Assess the data quality and cleanse the data where necessary.
  2. Merge sources, i.e. deal with incompatible data formats, enrich data where appropriate, and, more importantly, map all data sources to the data warehouse.

Once you have a detailed catalogue of all data sources, objects, relationships, data types, and descriptions about each item, you can take a look at the data itself, that is its content, and by association its quality. To assess the quality of data, you need at least three pieces of information: 1) what the data is used for, 2) what the data is supposed to represent, and 3) what the data really represents. Before you proceed it is important to establish which data you actually need. Data in relational databases is often redundant, so that you seldom need each and every table or column. Once you know you need a particular piece of data, you move on to the next level: find out what the data is supposed to represent and what it actually represents. The reason I wrote ‘is supposed to’ rather than ‘represents’ is that often an attribute or table seems straightforward enough, and sometimes you find documentation that seems applicable, but the people who use the data on a day-to-day basis know ‘better’. If there is no ambiguity with regard to the use and contents of source objects, you’re in a bit of luck; you can immediately finish the data integration document for those objects. Quite frequently that is not the case though.

Suppose, for instance, that there is a flag in a table of one of the source systems that, according to the documentation you discovered, indicates whether or not a particular entry is ‘active’ and thus considered in subsequent analyses by the system. When you check the data you see that it has indeed two distinct values: Y and N. Perhaps an occasional NULL but nothing that seems out of the ordinary. You shout, “Woohoo!” and continue with the data integration. Well, woohoo nothing! Several months later a client or business associate complains to you that the ‘verified’ flag is absent from the data warehouse, upon which you kindly point out that there never has been one, to which your client will answer, “Sure, there is! It’s called IS_ACTIVE.” At that point you scratch your head, re-read the documentation, and reply that that flag is not supposed to be used for that purpose.

So, who’s right? The customer is always right. It is quite possible that the people who populate that particular field repurposed it because they desperately needed the ‘verified’ flag and had no use for an ‘active’ flag. Since there was already a field with the correct data type available, they decided it was a good idea to (ab)use it.

Sometimes the data type matches the original and/or actual use, as in the case of our ‘active’ flag. There are also cases where the data type is completely off. For instance, imagine you find that possible values for the IS_ACTIVE column are Y, 12.7, ===ERROR===, and cake. Now what?! In these and related cases data profiling can be a worthwhile investment. Data profiling is often seen as unnecessary as it adds overhead to the project and can lead to analysis paralysis, but having an automated system that monitors all (candidate) source data and metadata relevant to the data warehouse allows you to focus on what’s really important, namely the data integration itself. Data profilers typically show statistics — point estimates (mean, median, standard deviation, …), interval estimates (confidence/credible intervals), histograms, and so on — and fancy charts that highlight potential issues and compliance to predefined data standards. A data profiler can also be used to identify redundancies, relationships, and patterns within the data.

Data integration is not a one-off task, particularly when the integration runs continuously or at least periodically. There will always be bug fixes, change requests, improvements, enhancements, and the like; data sources also evolve. The same applies to data profiling. Throughout the life of each system, including the data warehouse, it is vital to keep a keen eye on all metrics pertaining to the data and its metadata. The difference between having data and having good data is all that really matters.

In fact, according to Gartner, one-third of all Fortune 100 companies will soon slide into information crises because they cannot rely on the quality of the data contained in their decision-support systems. Predictions by Gartner’s industry analysts often miss the mark but this one is, based on my experience, likely to be spot on. Whether the majority of the unlucky 33 Fortune 100 companies will actually identify the crisis as such is another topic altogether.

Data profilers are of course not a panacea. You still have to cleanse the data, which is by far the most time-consuming task. It’s also the single most important item on your already long to-do list as a data integrator. Profiling data before you embark on a data integration project is recommended, as it leaves you with a clearer picture of where you expect problems with the data to pop up. Because you know about it, you can plan for it, budget it appropriately, and act accordingly. I have had numerous encounters with data experts and system administrators who warned about data quality issues, but they could not give examples of it. Without examples that show the extent of data quality problems it is impossible to estimate the effort required to whip the data into shape. Similarly, vague statements such as “The majority of our data is accurate” are of no use to a data integrator. What is the majority: 50.1%, 80%, 99.999%? Even though you have to rely on the information provided by data experts, you should not take everything at face value; a data integrator should always have a few grains of salt at arms’ reach. Therefore, it’s always best to check and remove all doubt.

All? Well, most of it anyway. Some data issues will only be discovered during the project, and they can jeopardize the schedule. You have to leave a bit of headroom for unforeseen quality-related efforts, typically a percentage of the total data cleansing efforts already estimated. My number-one rule when it comes to data quality is that it’s always worse than you or any knowledgeable person thinks.

Incompatible data formats can sometimes throw a spanner in the works too, especially in cases of unstructured data. Both structured and unstructured data should be carefully considered, even though generally only structured data ends up in data warehouses. Again, sifting through the data and profiling it is a good start but without a solid understanding there is little chance that you are able to understand it well enough to succeed at data integration. It is not enough to know which data to extract and how to transform it, you need to fully comprehend the data and its value to the business, so that you can make correct decisions as to why you need to transform the data in a certain way. Once you know what data you need and why you need it, you can start thinking about alleviating data incompatibilities issues. You need both the technical abilities to integrate data sources and the detailed business know-how to support it.

Source Documentation

Since we’re on the topic of business knowledge, let’s talk about source documentation. Systems, proprietary or otherwise, are rarely designed to be integrated; that does not only represent a major technical challenge but also one of information, which is vital to the project. Documentation on data contents and internal logic is consequently scarce. Custom business logic is often completely undocumented, and the people who happen to have created some documentation have often moved on within the organization, quit, retired, or been fired. Documentation is also frequently in the form of user guides rather than manuals for developers about where what data is stored.

That leads us to the probably second most time-consuming task: information acquisition. Knowledge is sparsely distributed, which means that you have to talk to many subject-matter experts, hence increasing the risk that you obtain contradictory information.

Summary

Data integration projects are not only technical challenges. Non-technical aspects, such as project management and dealing with stakeholders are demanding too. What makes data integration technically challenging can be summarized as follows:

  • The context, business requirements, and additional constraints determine the data warehouse architecture.
  • (Proprietary) systems have rarely been designed to be integrated.
  • Problems with data quality are sometimes only apparent during the project.
  • Data cleansing is time-consuming, thankless, but absolutely necessary.
  • Expertise is rarely concentrated: many people know bits and pieces, few people know a lot.
  • Knowledge is often only available ‘in heads’ and rarely documented; if documented, knowledge is often in unofficial documents.
  • Custom business logic is rarely documented or easily accessible.
  • Systems are dynamic: data integration is never done.