When we speak with new or prospective clients, the topic of their organization's data quality often comes up. And it's frequently our experience that when someone who is not data-savvy thinks their organization has poor data quality, it turns out that they have a limited frame of reference. What they go through looks like this: it takes a long time to generate data for their consumption; when that deliverable comes it is problematic (figures seem inconsistent or they don't add up) or is difficult to interpret; summarized data doesn't square with what they "know" to be true; and so on. It's not unreasonable at that point to assume that organizational data is low quality.
While those situations can reflect poor data quality - timeliness is certainly a key dimension of data quality, reports whose elements sum to more than 100% of a published total may reflect duplicate entries or coding in a source database, and some users do know their data inside and out - in our observation they tend to be more indicative of poor process and weak communication, which speaks to a breakdown in or a fundamental lack of data governance, data intelligence, and/or data management.
How is bad data identified in the first place? Sure, it can happen because a keen-eyed person looks at an individual record and notices an inaccuracy (that person's name is misspelled, or they no longer live at the address recorded for them) or an invalidity (the zip code for Chicago is not 34543!) or inconsistency (these people are category A in this system but A4 in this other system), but frankly those instances are rare.
Somewhat more common may be the situation where a person knows something about organizational data, maybe because they have a vested interest. Perhaps a person recognizes that the list of all the sales closed last quarter is incomplete. Or I know that this patient was discharged two days ago, so why am I being prompted for a status update when I log into my clinical portal? Or we're trying to close the books on a fiscal period and some invoices that should have been paid weeks ago are still marked as open.
The most common reason people grow skeptical about their organization's data quality occurs when they look at it in the aggregate. Numbers change without explanation or warning, possibly in hard-to-believe ways: last week we had 780 applications complete, and this week we only have 770! Figures don't square with expectations: our sales forecast for this quarter is $9 million, but our pipeline only shows 4! Or perhaps the most common: I asked for this information last week, but by the time I received it, it was too late to act on.
So this user perception of low-quality data tends to show up in formats that don't lend themselves to quick quality checks, and potential problems tend to be noticed by people who may not otherwise work directly with organizational data at the record level.
In fact, perceived data quality issues may represent many different problems, and they can develop at many places in the data management lifecycle.
- Sometimes the issue is inaccuracy or a lack of validity, and the source of the problem occurs during data capture: user error, or users entering data inconsistently, or insufficient data validation (during bulk imports, say).
- Too regularly, consistency and integrity issues arise when data moves from system to system: this can be like the game of Operator, where the more hands that touch the data the more likely it is to be corrupted or degraded along the way; this can also be due to differing formatting and storage rules among complementary systems.
- Sometimes – as with our running list of sales example, above – data is being maintained in a shadow system along with instead of our system of record, and we suffer from a lack of completeness. (Or our system of record is not even being consulted, and users are relying on out-of-date extracts or haphazardly-maintained databases of their own design! See this article for more on this state of affairs.)
- And sometimes data gets out of shape during extraction, manipulation, and presentation, which can signal a lack of familiarity with data uniqueness, it can showcase organizational inability to provide data in a sufficiently timely manner, and it can also highlight data literacy shortcomings.
There is good news. If the situations described above can be recognized, steps can be taken to address them. An even bigger issue when it comes to data quality, however, may be that in many cases we don't really know whether our data is high-quality, low-quality, or somewhere in between. And why don't we know?
As a data consumer, I have to evaluate data quality based on limited access to data, often using a heuristic like "is this count of x accurate?" But I typically have very minimal firsthand knowledge of x, I have almost no visibility into the way the count is produced, and my standards for accuracy might be based entirely on unfounded or untested assumptions about x, or the way it's counted, or previous counts of x that have been provided.
Challenges when considering data quality can be grouped into the following three categories.
- Context. If I don't know enough about what some office or division does, or if I haven't seen enough data about certain operations, I don't have any way of knowing whether that figure I've been provided of, say, 120 alumni living in Sedona is reasonable. I can maybe make some basic inferences - if we're a community college on the east coast I might be skeptical of that figure, but then again maybe we have a longstanding arrangement with schools in Arizona and we occasionally swap students - but I can't know without further study. Without a contextual framework to recognize data issues, it's difficult to provide informed feedback.
- Definitions. If we define alumni as people who took at least one class from our institution, including on-line classes, and we have a national presence because of our programs in desert wildlife management, maybe that figure of 120 alumni in Sedona is far too small! Without a proper and comprehensive definition of whom we include when we count alumni, I'm entirely at a loss when presented with that figure. Without data definitions that include both technical lineage and business usage, judging data quality definitively is essentially impossible.
- Curation. The reason data takes time to deliver may be that it's being checked and rechecked prior to passing it along to management. Sometimes a data request, especially for something new, takes time to make its way up the demand queue; sometimes the technical environment is slow or suffers from latency; sometimes the requester hasn't been granted clearance to certain data sets, and there are then procedural steps to go through before any queries are written or executed. Even if I specify clearly who qualifies as alumni, and what it means to reside in Sedona, our data set may not be in a position to readily give up the information I want, and/or our analysts may not be familiar enough with the data to know quickly whether their results are accurate. Without a curated data set, and data curators to support organizational usage of it, competing figures, interpretations, and analyses of data will prevent or at least severely hamstring decision support activities.
Our response is: data quality management (DQM) is a data governance initiative. Data governance, when it's successful, is characterized by openness and transparency, collaboration and communication, and by iterative improvements to process and practice. So for data quality to get better, and for data to be trusted, DQM programs will need to leverage the knowledge of people who manage and maintain data: data stewards, subject matter experts, data and business analysts. And DQM efforts will benefit from adopting or mirroring the data governance framework they are part of.
We could write about data quality until the proverbial cows come home, and over the next few months we just might! But in the meantime, here's hoping this post has opened up your horizons as you think about data quality challenges and opportunities in your organization. And if you're interested in some practical steps you can take, we recommend you watch this presentation from our founder and CEO, Brian Parish.
Link to Data Governance Resources page for additional resources. Feel free to check out our other data quality resources in this blog post.
IData has a solution, the Data Cookbook, that can aid the employees and the institution in its data governance and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.
(image credit: StockSnap_3HGXPSXH2B_manthinking_reframingdataqualityproblems #1021)