We've been at this data management thing for a while now, and while much has changed over the last couple of decades, there are a few archetypes of data failure that continue to bedevil our clients which includes reference data.
One, of course, is the "this report doesn't match that report" phenomenon. Two different offices or two different analysts report different numbers; figures seem to change an unexpected amount from day to day or week to week; numbers just don't "look right."
Another archetype, often related to the first one, shows up in the litany of data quality complaints that get registered. Data is missing, data is out of date, data is erroneous or inaccurate, data collected one place isn't entered according to the same rules and formats as the same data collected elsewhere, etc. Sometimes data is collected, stored, and maintained properly, but reported or presented ineffectively, and so a default assumption in those cases is that the data is bad. Whether reality or perception, data quality challenges can be, well, challenging.
We've written and presented about the above scenarios many times. Today we want to look at another data management fundamental that trips up too many of our clients, and that is something called "reference data."
Reference data is data that is used to structure other data, which is a highfalutin' way of saying something fairly simple. If I want to enter a person's address, much of that information is handled using reference data: there are only so many states (or countries), only so many zip codes, only so many street types, etc.
Some reference data comes from external sources (in our above example, the US Postal Service), but often you create and manage them internally. For example, you may have a list of permissible prefixes (Mr., Mrs., Dr., etc.) or suffixes (Jr., III, Ph.D., etc.). And as you get deeper into your core business, you'll have reference data specific to your business. In education, something like student types; in medicine, a list of medical procedures; in commerce, perhaps a list of product categories.
No system really operates without reference data, which is critical in several ways.
First of all, reference data saves time and labor, as well as increasing accuracy. If I want to add the state someone lives in, it's much faster to choose it from a drop-down list or similar technique than to type in the full name, or to hope that I remember that MS is the postal code for Mississippi, not Missouri.
Second, reference data helps enforce consistency, whether through the use of lookup tables or simply as data capture rules. (This is true whether data is entered manually or via automation.)
Third, reference data contributes to improved efficiency, as data that comes into our organization's systems can just go there; no one has to do any research to figure out the acceptable range of or format for certain values.
Fourth, because reference data is shared across multiple domains, these benefits aren't limited to just one department or set of users.
In your database applications, reference data often shows up as a code or lookup table, which is maintained separately. These lists of valid values can be sourced externally, and/or they can be created and occasionally updated internally. Of course, the list of valid values that can be used in a particular field or data location isn't necessarily stored in a code table.
Either way, however, reference data can present some thorny data management challenges. People need to know what codes refer to, if the data entry and maintenance involves stored codes, and if you have something like a shared set of standards (say always using "BLVD" as shorthand for "Boulevard" in an address), people need to know that even more. Mistakes are still going to be made, so some kind of error checking needs to occur. If I'm in a hurry and I type "MA" instead of "MS," then the next time a mailing goes out maybe the recipient in Gulfport doesn't get what we send them. This error checking can be automated, or manual, or a combination.
User knowledge and behavior aren't the only challenges, and they're probably not the most critical, either. One of the most daunting aspects of reference data is that while many offices may use the values stored in code tables, only a few individuals or offices may have the ability to make changes, and these people don't always know how (or even that) the codes they manage are used elsewhere in an organization.
If I make a change to a lookup table and you don't know about it, what often happens is that at a very inopportune moment some key report or process basically breaks. And while there's something to be said at this point about not hard-coding dynamic values, that's a separate discussion. The key issue here is that people who work with data have to consider the ramifications of changes they make to data structures.
Another challenge is that different data systems may capture reference data differently. For example, system A allots two characters, but system B allows for four. Perhaps system A contains codes for medical procedures that we use in our hospital, but system B uses different codes around insurance billing and reimbursement. And who knows, maybe there's a system C that we use to report data about our work to some kind of government agency! So now we not only need to maintain three different sets of reference data to describe the same activities, we also need to maintain a set of translation tables!
A related feature of reference data is that it can be nested. A simple example might be county codes by state: first I choose the state code from one drop-down or smart list, and then I might be presented with a list of counties in that state. Now county names aren't changing that often, but this scenario is probably repeated at some level in every data domain at your organization. And reference data frequently needs to be "rolled up" for aggregating and simplified reporting and analysis. Sticking with our states as examples, sometimes we'll want to present data aggregated or sorted by region.
The Data Cookbook (Enterprise Edition) contains a rich feature set around managing reference data sets. It allows you to define reference data using everyday language, to associate a master list of values to any reference data set, to associate multiple lists (by data system), to map those lists back to the master list, and to document roll-ups. It also has an workflow to manage changes and to record approvals. And it can use the IDataHub to connect to your databases to check for and notify users about changes to the lookup tables in question!
But even if you don't have the Data Cookbook yet, you can do a better job managing reference data. We recommend organizing your reference data management practices so that the following questions are easily answered:
- What is the master set of values for a particular reference set, and where is it stored?
- Who is responsible for keeping any reference data current, and who needs to be notified when changes are considered?
- What other data is affected when reference data changes?
Key reports and potentially other items in your data catalog
Definitions of related data elements
Consistency of "downstream" data collections
- What is the process for verifying or reverting changes, and who is responsible?
As with many data governance tasks, this is not glamorous work. But as with so many other data management issues, small errors can have enormous impact.
Photo Credit: StockSnap_9LNS3XLF9Y_hidingplainsight_referencedata_book_BP_#1104