As data practitioners, we assume that even the most skilled user can and will make mistakes, and we have a number of safeguards in place to prevent or remediate them. When users are entering an address, we do not expect them to remember how many times "S" appears in Mississippi; instead, we give them a drop-down list of abbreviations and state names. Fat-fingering or misclicking can still occur, of course, and all of sudden we might be trying to locate someone in Biloxi, Michigan, although some systems have additional safeguards to validate zip codes.
Over the years many organizations have developed additional automated or somewhat automatic methods to detect data errors. In the case of bad address entry, maybe there are reports generated regularly that compare valid zip code ranges to state codes, or maybe there is a process to validate addresses before printing labels. At some point a human might even lay eyes on this data and recognize a potential discrepancy. For example, that person could know that the company sells a lot of product to buyers in and around Biloxi, MS, but that Michigan is not part of the sales area. And sometimes inaccurate or outdated data goes undetected until it is too late, and in this particular example something could get shipped to a nonexistent address.
Now, if what is being shipped is a postcard, or an invitation to look at our products in a showroom, the magnitude of the mistake is minimal. If, however, we are trying to send lifesaving medical supplies, there's a bit more urgency around them getting to the right place!
Again, we assume users will make mistakes. The training happened a long time ago, or they just forgot the rules and recommendations on this day, or they were busy or distracted because of some other event, and so on. Because we know these things are likely to happen we build in some resiliency to our systems. Most desktop software allows you to undo, and important documents can be backed up automatically to enable restoring a pristine copy; most applications will time out after a certain period, and workstations controlled by your organization will log users out after inactivity; modern portable computing devices are encrypted and have additional security features enabled by default.
We apply other safeguards to our systems as well. Not all users can view all of an organization's data, for example, and even the data that a user can view or access is only be visible in certain situations, using prescribed tools. We apply some error-checking heuristics when we move data into a reporting or analytics data store (e.g., our warehouse, our data lake). If we are not running some kind of automated data profiling tool, we at least have exception reports that can flag potential issues.
While many organizations have considerable resiliency around data collection and storage, they often have far less resiliency in place when it comes to integrating, publishing, and analyzing data.
Let's go back to our possibly real-world example above. If we want to know how many potential purchasers there are in Biloxi, and we ask an analyst that question, or we try to figure that information out ourselves using our business intelligence skills, and we forget to specify the state, we could get misleading results. This particular example is not likely to be too deleterious for our organization, but we can easily imagine situations where one big mistake, or numerous small mistakes, at any link in our data management chain, would result in unreliable analytics.
Even with the best technological tools in place, customers still come to us knowing that they do not really have a handle on the data they are collecting. They do not know, for example, which of their systems are storing personally identifiable information (PII), and which systems store confidential or otherwise restricted information. They do not know the full details of process by which this data is collected, moved along, shared, and potentially disposed of.
Why don't they know? Again, there are many reasons. Organizations store data in dozens of systems, some of which are used across the enterprise and others of which might only be used by one small team somewhere. Offices collect data for different purposes. The people who originally made decisions about what data to collect and where to store it move on, and new people have different goals. Frankly, there is a lot more data available now, and a lot more ways to get our hands on it than there used to be, and some parts of organizations are hungrier for data than others.
When it comes to an organization's data, users often don't know exactly which pieces or sets of it they have access to, they do not know who manages the data, and they do not know how to go about getting access to data that might be useful to them. Moreover, far too many of them do not know how additional data might be useful, so it does not really occur to them to seek it out. The fact of the matter is that many users already are faced with an oversaturated data set that they do not have more than the faintest idea what to do with--why would they seek out even more data?
Data consumers need to be looking at data that is meaningful to them, and they need to access it through a useful analytical framework. That will vary by user, by data source, type, and product, and, of course, by the questions that they are trying to use data to answer or elucidate. What safeguards can we put in place to ensure not only the accuracy but also the usefulness and clarity of our data?
Obviously, it is not the case that a single person in an organization needs to have complete knowledge of all the data circulating through that organization, if that were even possible any longer. Indeed, in most cases, we suspect that most people do not need to be data experts, even regarding the data collected and managed by the domain in which they work! But they do need to be equipped to ask questions about data, to speculate about ways to use additional data, and to apply the results of data analysis to business operations and activities.
A knowledge base about organizational data seems like a no-brainer in this day and age. If nothing else, that knowledge base would identify all the data systems and applications in use, and some level of detail about the kinds of data each one stores, accesses, or utilizes. A catalog that tells users where to find data, and how to understand it when they encounter it, and whom to contact if there are further questions, would also be a key element of the knowledge base.
Along with this knowledge base we might consider some practical improvements to the way users access, share, and deploy data. For instance, we frequently work with customers as they develop curated data sets from which analytics and business intelligence can be drawn. This curation involves understanding data lineage, unifying and tracking business processes across the data lifecycle, and creating clear and accessible documentation around the logical, semantic, and presentation layers of the data set. And many of our customers have found it helpful to implement data certification processes, particularly around management dashboards, KPI presentations, and other frequently accessed reporting deliverables. This certification involves vetting and classifying data, agreeing on terminology and usage, and building workstreams that prioritize and streamline the creation of certified data products.
Clearly, what we are talking about is data governance and data intelligence, in some of its many manifestations, and the people, processes, tools, and technologies involved in its success. But we have made it through this whole article without using that phrase, and without invoking the specter of the bureaucratic shift and seismic change management that could accompany it. If that phrase is going to scare people away or turn them off, then do not use it--just do it.
What are your organization's data goals? What impedes your organization from meeting those goals? What are you already doing that is helping as you strive to meet those goals? Our tool, the Data Cookbook, is replete with features that can help you build on existing resilient practices, and that can help fill in those practice and knowledge gaps that are preventing you from meeting your data goals.
Hope you enjoyed this blog post. For additional data governance / data intelligence resources (recorded webinars, blog posts and videos) then please check out our "Wealth of Resources on Data Governance/Data Intelligence Topics and Components" blog post.
Photo Source: StockSnap_EWYPRFUOID_buildingcurves_dataresiliency_BP #B1256