The Challenges of Data Integration: Project Management (Part 2/4)

In this second post of a four-part series on the challenges of data integration I want to talk about project management. Data integration is, as I have said before, not simply a matter of throwing technical people at a business problem. Not literally of course: most people do not like being flung at things, abstract or concrete, but probably at the latter a bit less than at the former.

Project management is the key to your success. Sure, you need able people to build the data warehouse, but without a solid foundation in project management your project will tip over at the slightest sigh. And please take it from someone who has been there, done that, got the T-shirt, and has outgrown it: there will be a lot of sighs during the project, even full-blown tornadoes… To weather any storm, you and the entire organization have to live project management practices. Project management is not the silver bullet, but it can protect you against the most common enemies: no idea, no plan, no back-up plan, and no support.

Raison d’Être

A data integration project absolutely needs a clear business case. Data integration is never the desired outcome, it’s simply an enabler. Data integration is not a vanity project to be used to add a shiny line on the CV of a wannabe manager.

Suppose your company has discovered that it’s impossible to match reports from different departments because the data is too messy and not readily available; for instance, the data is stored (read: hidden) in personal spreadsheets. You may decide that data integration is the solution but remember: it’s only the enabler to generate consistent reports. A data warehouse is a hammer and data is the nail. You use the hammer to beat nails into the wall in order to hang a picture. You neither buy the hammer to admire it nor display a box of nails for its artistic value. Well, normal people don’t at least.

Why is it so important to have a reason for doing the data integration? Two reasons really: to get the project accepted (i.e. sell it) and to get acceptance for the project (i.e. buy-in).

Sensible organizations require you to do capital budgeting or at least a cost-benefit analysis before a decision is made on the proposed data integration project. If you have a clear business case you can estimate the benefits from the data integration, whatever they may be. Without one you only have indirect benefits: higher productivity thanks to a single, consistent source, increased transparency within the organization, and so on. But how do you put a monetary value on benefits as vague as these? That’s exactly the point: you don’t.

It obviously depends on standards within your organization, but what I consider the metrics you absolutely must calculate to see whether or not to proceed with the data integration project are the net present value (NPV), the discounted payback period (DPP), and the internal rate of return (IRR). For all three you have to list the initial investment and expected cash flow for the entire duration of the project, which includes the time after the actual data integration when you expect to reap the benefits from it, that is, the full life cycle of the data warehouse.

The NPV tells you whether or not your project is expected to add value to the company. If your project has a positive NPV, then you’ve cleared the first hurdle. What it does not tell you is when you can expect to get your money back, which is what the DPP is for. As a rule of thumb, the DPP should lie within the first half of the project, otherwise the risk of not achieving the break-even point (in time) may be too high, but that may of course depend on what your organization considers an acceptable risk for the NPV given. The IRR is the annualized effective compound rate of return, which is a fancy way of saying that it is the expected rate if growth your investment is estimated to generate. Its value should ideally be higher than your company’s weighted average cost of capital (WACC). Perhaps your organization has a predefined minimum rate of return defined; check with the controllers to make sure.

The return on investment (ROI) is a simple metric you can calculate on the back of an envelope but it does not replace the aforementioned NPV and DPP. The return on investment (ROI) is the percentage you expect to gain or lose from the investment in the project. If you discover that you have a negative ROI you probably can stop right there.

By the way, do you see the not-so-teensy-weensy snag when you only focus on data integration tasks? You only incur costs (servers, software licences, maintenance and support, personnel, etc.) and generate no cash for the company’s piggy bank, assuming you are not replacing source systems in the process, which would perhaps be more aptly labelled a data migration project. Hence, your envisioned data integration project has no (monetary) value!

Once you have received the green light based on the value of the data integration project to the organization, you have a powerful tool in your hands to convince sceptics that the project is in fact beneficial to the organization. The buy-in of all stakeholders is essential to the success of the project, and the most sceptical people are often the ones whose support and cooperation you need the most. A recent article in Harvard Business Review confirms this view. Note that I am not talking about chronic whiners who continuously manage to slip through the cracks though.

Support

Knowing the true value of the project and its relation as well as relevance to other projects is very important in discussions with people. Equally important is probably the support of the entire organization, including a visible and respected sponsor. Both attributes are crucial: a highly visible sponsor without the respect of the troops will not be listened to. Similarly, a well-respected sponsor who has little or no influence is of no use. Prior to the kick-off, the sponsor often takes the lead on overcoming resistance to the project.

One of the most important tasks that an executive sponsor has with regard to the project team is to provide resources. Timely access to data sources, information, and subject-matter experts is vital to keep the project on track. The project manager must escalate issues regarding resources to the sponsor who in turn can mediate and resolve them immediately. The difficulty is in knowing when to escalate: too early and you look like a crybaby who cannot manage the project effectively, but too late and you endanger the project itself. I often find it easier to ask myself if not escalating is a viable option. Ask yourself: If I don’t escalate right now, will we a) move onto the critical path, or, if we are already on the critical path, b) delay the delivery of the milestone or whole project? If the answer to either is yes, then you should escalate immediately.

The moment you are about to get into the nitty-gritty of data you need full backing by business experts who are able to explain the contents of the data. Your organization has to have a culture that encourages communication and endorses complete transparency. As aptly phrased by Daniel Teachey from DataFlux:

[T]he integrity of a company’s data hinges on the integrity of its practices for collecting and managing data. Enabling an environment conducive to respect and cooperation can go a long way in effecting quality data that drives quality business decisions.

As the manager of a data integration project you are as much contained in as constrained by the environment of the organization. You cannot lead a data integration effort and change the culture or the company. Inert companies that are unwilling to share information beyond arbitrary lines drawn on an organization chart are not going to open up like a piñata just because your socks roll up and down when you think about your data warehouse.

I have mentioned them a few times now but I have never gone into specifics: stakeholders. Stakeholders are a complicated lot. Data integration projects tend to have many disparate groups of stakeholders who normally do not talk that much with one another: IT (administrators, architects, data integrators/profilers, developers, and testers), business people (analysts, subject-matter experts, and end users), and management (sponsor and end users). On the one hand, you have to include enough people to make sure your project has all the support it needs. It’s also important to make sure someone does not feel excluded, as they may start sabotaging the project through backchannels; it does happen unfortunately. On the other hand, don’t cast your net too wide or you’ll catch a lot of small fry with big mouths. The more the merrier is not a phrase you should keep in your vocabulary when it comes to data integration projects. More people mean more opinions, and more opinions mean more to take into account when making decisions. Constructive criticism and communication are paramount, but too many cooks in the kitchen spoil the broth: frustration is a common gripe that leads to little or no progress being made, which causes a vicious downward spiral to a full standstill.

Be Prepared

A project has to hit four prime targets to be successful: time, cost, scope, and performance requirements. A project that is late, over budget, slimmed down in scope to meet the original performance requirements can hardly be called a success, and it has probably been a nightmare for everyone involved too.

Where projects separate the wheat from the chaff is in the preparation phase. Stumbling haphazardly into a project is never a good idea but for data integration there can be severe penalties due to unforeseen data quality issues that require substantial data scrubbing, which by its very nature progresses as slow as molasses. Solving problems with regard to lower-than-desired data quality is a ‘known unknown’: we know that data cleansing affects the schedule significantly, but we are unaware of the magnitude of the work required. Therefore, profile data before the project as it allows you to estimate the effort more realistically.

The adage that it takes money to make money is transferable to time by identity of time and money: it takes time to save time. The time invested in data profiling is well worth the money spent on it. That is not a carte blanche to go bonkers though: you cannot and should not need to know everything about the data when you make your project plan. All you need is a reasonable estimate of the size of the tasks ahead.

What may come as a surprise is that the effort to cleanse data scales with the size of the data sets. It’s sometimes a highly non-linear relationship but there definitely is a positive correlation between both, even when you correct for the additional time it takes to process more data. The reason is simple: the more data, the more possibilities to make a hash of the data.

Furthermore, you should begin with the end in mind. Establish criteria for acceptance of the project at the outset. Define milestones, as usual, and agree upon specific, measurable, realistic sign-off criteria. Once the project is under way you have neither the time nor the power to define the criteria for acceptance without endangering the project. It’s plain sailing for data sceptics to oppose the outcome of the project simply because it’s not clear what constitutes success; you cannot prove success when you neither know what you have to demonstrate nor how to do it.

All in all, be sure to plan for at least the following tasks:

  • Data profiling
  • Detailed requirements analysis with subsequent specification of the business case
  • Criteria for acceptance of each milestone and the project itself
  • Full system tests
  • Certification of core components
  • Development of unit tests and quality assurance (QA)
  • User acceptance tests (UATs) with subject-matter experts
  • Extensive documentation for end users as well as integrators and administrators
  • End-user training

Most of these deal with software testing but I have, unfortunately, seen quite a few data integration projects where software testing was ignored because it was considered an extravagance that had not been included in the schedule for various horrendous reasons. Needless to say, the data quality was for a very long time not up to par, and regression errors were rife to the point of having slowed down the deployment considerably. Why throw out the baby with the bath water? Do not waste standard software practices: a data warehouse is a piece of software, and all software should be thoroughly tested before being rolled out.

Risk

Risk rears its ugly head in various guises when it comes to data integration projects. Any project has risk associated with it due to uncertainties. Somewhat surprisingly, many of the uncertainties are sort of predictable. The truly large-impact, hard-to-predict, rare events, also known as black swans (‘unknown unknowns’), are rarely the reason projects derail.

Tom Kendrick’s book Identifying and Managing Project Risk is an excellent resource on the topic of risk, and I heartily recommend it to anyone who is serious about project management. What his Project Experience Risk Information Library (PERIL) shows is that the biggest threat to projects, including IT projects, follow a pattern; a great many of these perils can in principle be mitigated or avoided completely. Scope risks account for half of the cases reported in PERIL. Schedule and resource risks take up approximately one quarter each.

The root causes of project slippage, sorted in descending order of project impact, are:

  1. Changes made to the scope during the project (scope)
  2. Issues due to internal staffing (resource)
  3. Failure to meet the original requirements (scope)
  4. Delays within the purview of the project (schedule)
  5. Inadequate schedule estimates (schedule)
  6. Issues due to outsourcing (resource)
  7. Project dependencies external to the project (schedule)
  8. Insufficient financial resources (resource)

Let’s take a look at each risk category in more detail.

Scope Risks

The main risks in the category of scope are:

  1. Gaps, or legitimate changes to the scope during the project
  2. Scope creep: non-mandatory changes to the scope during the project
  3. Software defects that negatively affect the deliverables
  4. Hardware defects that negatively affect the deliverables
  5. Integration defects of either hardware or software that negatively affect the deliverables
  6. External dependencies that require the scope to be altered

Scope creep is the top candidate to look out for. If there is a clear need to adjust the scope based on information you receive late in the project, there is not much you can do about it; gaps can arise because of regulatory requirements, for instance. It is enticing to tag on additional requirements as the project goes on, but it’s is extremely demotivating for the team: imagine running the marathon and each time you can see the finish line in the distance someone moves it back a mile.

I was, on several occasions of a data integration project, in the unfortunate position to have been overruled by the executive sponsor, who also happened to be my boss. As the project went full-steam ahead and stayed on track, the sponsor thought that additional features would make it an even bigger success, so it was decided — without my consent — to extend the project by tagging on more and more. In fact, that game was played twice by the sponsor and a handful of managers who scoffed at my arguments to postpone the additional requests. It was clear that the features could be added later without much ado, so there was no reason to take immediate action, but according to my superiors they absolutely had to be done now. Naturally, the project slipped and I lost credibility as a project manager.

I generally recommend project managers to maintain a list of additional requests that can be implemented at a later time, in a future release. Managers often want it all and they want it now, but that may well not be in the interest of the project, and as a project manager your prime responsibility lies with the project. Once you have prioritized the items on the list, you can outline a road map for the data warehouse for when the data integration phase is complete. That not only shows your commitment to the project and its ongoing success but also that you have a plan for afterwards. It may not always help, as in my case, but it’s best to be prepared for all eventualities.

The impact of software and hardware failure to data integration projects can be minimized or even eliminated by common sense: version-controlled code repositories, up-to-date back-ups of all relevant code and documentation, failover of critical systems (e.g. virtual machines on different physical servers), and basic security (e.g. recent updates, firewalls, antivirus software, and encryption for data transfers over networks). Scope changes due to hardware defects are in my experience with data integration projects rare.

With a decent test base, software defects mostly pop up in unit tests. Without it, these defects can send you the bill at the UATs or in production, which can be very costly indeed, as you have to revisit the original requirements, adjust them, redesign the test cases, write new code, and run all relevant tests again at a time when you thought you were nearly or already done. Therefore it is essential that you schedule enough time to analyse the requirements thoroughly to ensure that the unit tests capture all the business cases that need to be examined.

Integration defects are likely under-represented in PERIL in the case of data integration projects due to the nature of the undertaking. These are problems that arise due to the integration of and interaction with multiple systems. Data integration software can reliably access a multitude of sources, so that it very rarely causes scope adjustments. What can be a nuisance is the communication between the data warehouse and other systems, especially legacy systems for which the interface may not be compatible any longer.

Schedule Risks

When it comes to the schedule, the top-10 threats to the project come from:

  1. Underestimating the learning curve (estimates)
  2. Waiting for deliverable components (delay)
  3. Waiting for specification and/or data (delay)
  4. Project interdependencies (dependency)
  5. Poor estimates and/or inadequate analysis (estimates)
  6. Waiting for equipment (delay)
  7. Waiting for infrastructure to be ready and/or support (dependency)
  8. Unrealistic top-down imposed deadlines (estimates)
  9. Shifts in legal requirements, regulations, and/or standards (dependency)
  10. Untimely decisions for escalation and/or approval (delay)

As you can see, these schedule risks appear in three shapes: estimates, delay, and dependency, all of which are approximately equally represented. Personally, I don’t think that the learning curve is a major risk for data integration projects, provided your team has the know-how and experience.

What does significantly delay data integration projects is waiting for information (support), data, and the infrastructure to be in place. Here the backing of an executive sponsor can be of the essence, especially when the organization is dragging its feet. Any delay in getting the data warehouse and all tools needed for the development up and running, including access for all developers, or access to data sources as well as source documentation will impact the schedule.

Once it took two whole months after the data warehouse had been technically ready to be used for all developers to receive the proper credentials to access it. Actually, it happened on three separate occasions (once with the data warehouse and twice with application servers) at that particular company… Save yourself some trouble and ask around for typical time frames for commissioning standard services, such as databases and application servers, and incorporate these figures into your estimates. If you’re in a hurry, you can use the dead time to talk to people about their data.

What is more, commitments should always be made based on a realistic plan, not promises to customers: it’s a guarantee for failure!

Resource Risks

The last but certainly not least group of risks is that of resources. There are three subgroups within resource risks: people, contracting, and finances:

  1. Loss of permanent staff members (people).
  2. Deliverables are late and/or poor output from contractor (contracting)
  3. Insufficient funds (finances)
  4. Loss of temporary staff members (people)
  5. Delays due to bottlenecks (people)
  6. Late availability of staff (people)
  7. Loss of cohesion and interest (people)
  8. Contracting-related delays (contracting)

Most of the risks are self-explanatory, but one that I would like to highlight is delays due to bottlenecks, in particular what I call the bus factor. It is the number of people that have to be hit by a bus for the project to be in danger. No data integration project should have a bus factor of one or less. That is especially true of the core data integration team: all data integration activities are likely to be on the critical path, so any hold-up in these activities will delay the entire endeavour. If you only have one data integrator capable of doing the work you may have severe problems in case he or she is run over by a bus, or, less drastically, in bed with the flu.

The status of some of the work involved in data integration is hard to measure, for instance source data analysis and data scrubbing. Contractors, but regular employees too, can often say that everything is fine and dandy right up to the moment a milestone is about to be missed. Regular (weekly) meetings or conference calls are definitely necessary but they are not sufficient to guarantee that headway is being made.

Adding manpower can be useful when the output of a contractor becomes unsustainably low but you are usually saddled with more costs and very little to show for it as detailed in Frederick Brooks’s book The Mythical Man Month. You do not want to fall into the consultancy trap.

Monitor and Document the Project

It is the duty of every project manager to monitor the project. Without monitoring you have no control over the situation and thus little chance of steering clear of obstacles.

Whether an Excel worksheet is enough for you, or you need project management software depends on the scope of the project. Personally, I’d go for a solution that has full project planning capabilities, such as Microsoft Project, and a separate system for the actual software development. Some solutions offer multiple features in one package, including basic project management tasks, such as JIRA and Azure DevOps Server, which is a full application life-cycle management solution. If you’re in a project organization you may even have an enterprise solution such as AtTask or Primavera. In general, though, you simply pick whatever software is customary for your organization. If your organization has no standard project management software, you can always revert to Excel or choose a free, open-source solution like ProjectLibre. Remember: you don’t have to re-invent the wheel, and it’s not an arms race, so take only what you need. No more, no less.

I pause to emphasize that the tools you choose to monitor the project should not be exclusively devoted to software development. Yes, the software is the major component but data integration is not successful if the software functions as specified. It has to be adopted by clients too. The data integration project’s focus is generally on the technical aspects of the work but it should not be overwhelmed by it.

You’ll save yourself a lot of time if you create standard reports on top of your project data, so that you can easily publish and distribute your reports. I advise you to have three versions: a detailed one for the team, one with budgets for the management, and another one without budgets for the organization and possibly contractors. Not everyone of the stakeholders or team needs to be privy to the project’s financials.

Monitoring is fine but you should also document the project: future projects depend on it. The project documentation allows you to revisit points in time and it’s a valuable reference to better plan future projects of the same kind. It’s also a track record of you as a project manager.

Moreover, one of the most indispensable things for a manager of a data integration project is a decision log. It is usually not more than a spreadsheet with all the information pertinent to each decision: what has been decided by whom and when, related decisions and/or topics, for instance if the decision invalidates one made previously, the effort and cost required (if any), and references to more details, such as a link to an email or identifier of the minutes. The reason a decision log is such an important tool to the data integration project manager is that it allows you to stay on top of things and you have a trail you can follow to ensure the project is proceeding according to the schedule and what has been agreed upon. If you have a record of the decision and the underlying assumptions you have at least the possibility to retrace the thought process and thus figure out why another avenue was not explored. A tiny decision, such as what separator to use in a concatenation of certain fields, can have an impact that is not proportional to the magnitude of that decision. Finding out when, why, and who decided on that separator may be exceedingly difficult without a written record of all decisions great and small.

Finally, it’s standard advice offered by all: the project manager should not work on the project itself, as the management is always the first to suffer.

Summary

The main project management challenges of data integration are:

  • Data integration without a clear-cut business case is doomed to fail.
  • Active support from all stakeholders, including an executive sponsor, is essential.
  • A realistic plan can be hard to make but is absolutely necessary to sell the project and to secure buy-in. Data profiling can help you size up the effort for data cleansing.
  • Scope creep is by far the largest threat to the project’s success: do not allow non-mandatory changes to the scope. Instead, save them for future releases of the data warehouse, so they do not affect the schedule of the core activities.
  • Timely access to all resources (people, systems, data, …) is of the essence, but project dependencies may very well stand in the way. An executive sponsor may need to come to the rescue.
  • The organization needs to embrace a culture that encourages sharing information across organization boundaries, even messy data.
  • Make sure all your critical (integration) tasks have a bus factor larger than 1.
  • When you deal with contractors, get your ducks in a row before they become lame ducks.