Dimensions of Data Quality
Ian Hellström | 5 May 2021 | 5 min read
Data is like underwear: preferably hidden, and always a lot dirtier than you think.
Everyone in the executive suite claims data is an asset rather than a by-product of doing business, but the reality is quite different. Talk is dirt cheap.
Issues with data quality cost businesses up to a quarter of their revenue. While administrators or infrastructure engineers are typically responsible for the systems they manage, they are rarely responsible for the contents of said systems, and rightly so. Unless people who have a stake in the quality of the contents are made responsible for the data itself, and they have the authority as well as technology to expose and fix issues, data will always be messy.
Such data stewards or data guardians ought to be core to any data organization although there is a risk in making individuals responsible for cleaning up everyone else’s mess: frustration is what I have called the data governance sinkhole.
Data quality is a multi-faceted concept and it must be assessed along several dimensions.
Is the data set available and accessible to users who are authorized to access it?
Access control mechanisms must be in place to ensure those who should have access to the data actually do have access to it. To ensure compliance with data privacy regulations, fields certain (internal) users should not be able to see must be encrypted by default with keys that users do not have access to. A process to obtain relevant keys must be in place for those who have legitimate reasons to see such sensitive fields. It is possible to obfuscate such fields for everyone, but if that is done irreversibly, the data is gone.
Is the data set available on time?
This can be ascertained from SLAs. Upon breaches, alerts are automatically sent to the teams responsible for each data set affected. Alerts for non-essential data sets are excessive. Too many frivolous alerts condition people to ignore all alerts, even real ones:
[U]sers tend to switch off data quality checks if they receive a large number of false-negative alerts or if the alerts are hard to understand. Google (2017)
Is the data set complete?
Counters can be used to compare source and destination data sets, track how data sets grow over time, and for detecting temporal patterns, such as daily, weekly, or seasonal fluctuations. A sudden drop in records would immediately be obvious and can be alerted on.
Aggregates of numerical fields are also prime candidates for automatic validation with counters, as they can be used to check completeness against other sources of information.
Are all fields within acceptable ranges, or do they take on pre-defined values?
Semantic checks can be added to ensure incorrect values are flagged as early as possible.
Are the correct data types used?
For instance, is money encoded as a decimal rather than a float or double? For technical measurements, are the right number of significant digits stored? For text, are all characters properly encoded?
When issues are detected, an emergency brake mechanism may be in order to ensure problems in core data sets do not propagate, which may require a large number of backfills. It’s better to have data arrive late once—upon notification of all downstreams affected—than to have incorrect data that looks reasonable but needs to be corrected in every downstream afterwards. It obviously depends on the extent of the issue and the criticality of the data sets to the business: a single bad record may not warrant halting entire cascades of data sets.
Are all fields and records defined, understood, measured, and stored consistently?
Here it becomes murkier because it requires a common language across data sets, organizational boundaries, and schemas as they evolve. At the very least, core domain entities need to be defined so that everyone in the organization speaks the same language.
Whereas validity deals with examining individual parts, consistency is all about the whole. Invariances and known relationships among fields, records, or data sets must be encoded and enforced. Invalid data is relatively easy to spot (e.g. ‘yuzu’ as a phone number entry is obviously bananas), although it can be cumbersome to query or fix. Inconsistent data (e.g. a page enter event that happens after its exit), on the other hand, may be very hard to identify without checking known invariants and relationships in flight during an (idempotent) data pipeline.
Application telemetry, especially when used to inform product decisions, cannot be left to individual developers’ whims in terms of what gets logged, where, and how.
In terms of features for machine learning, this also means that all values unknown, null, blank, incorrect, or what have you are encoded in exactly the same way. It is imperative that features for machine learning are consistently stored across versions of data, which can easily get out of whack due to backfills.
Is the data set accurate, reliable, or at least fit for purpose?
It’s practically impossible to prove that a data set is entirely correct, as that would require proof of correctness of all soft- and hardware components it passes through, including humans.
The key is that data must be accurate enough to be usable. This obviously requires an assessment on a case by case basis. Such an assessment must be factored into each and every initiative.
Note that data can be valid (i.e. correct data types and ranges of values), consistent (i.e. a page enter and an exit match, and they are sequenced correctly), but still incorrect (e.g. a page enter/exit event on a button instead of, well, a page). Such incorrect but technically valid and not inconsistent data typically arises from bugs and leftover debug log entries that fly under the radar, as application logging is rarely spec’d and tested properly. In production, these seemingly minuscule issues pollute application telemetry data sets.
Is the data set trustworthy?
This includes documentation on provenance, details on all its columns or fields, intended use cases, and of course limitations.
Trustworthiness is also a combination of the above dimensions of data quality. If, for example, a data set is always late, that can cause confidence in it to decrease, especially when intended as a source for reliable derived data sets or data-driven products and features.
If a data set is regularly or automatically validated against others, details should be made available for anyone to inspect. Transparency is key to trustworthiness. And as always, trust, but verify.
Checks are commonly phrased as expectations. These expectations ought to be available as understandable descriptions in the generated documentation, including not only a check mark or ‘health’ score, but also tables and charts with the full results or at the very least links to the results. People interested in any data set must be able to assess its trustworthiness quickly and independently.
If the data set is a derived data set, it should be clearly marked clearly as such, which can easily be pulled from the workflow orchestrator’s lineage tracker.
To increase trust, highlight any anomalous values, check the evolution of distributions of values automatically, and visualize counts by category as functions of time, so that it’s easy to see drift in data. When the data distributions shift over time, that may affect decisions or machine learning products, which is why it is crucial to have that information available and easily accessible at all times.
The same goes for splits by common demographic factors, or even better: interactions. For instance, age and location, or race and gender. These can highlight issues with bias in the data early on. Detect bias in data as soon as possible, so as not to increase it by means of algorithms that can exacerbate built-in biases in the data. After all, fairness is a key ingredient of customer trust.
And that is not all. You can also think about confidentiality, compliance, understandability, and recoverability in case of human error or a disaster.
Remember: data quality is everyone’s problem.