In a previous post we shared some thoughts about shadow systems, and how they are symptomatic of poor data governance practices as much as, if not more than, users seeking to evade rules. Eradicating them is, we suggested, more or less impossible until the people who create and maintain them have better options: a better understanding of the dangers of shadow systems and the importance of governing data; better access to the data they need for their regular work; better training in and support for the tools and applications that are part of a proper governance infrastructure.
Recently, however, we have begun to see a related new trend, one we call “shadow analytics.” Tools like Excel have long offered us the ability to analyze data using pivot tables or common statistical functions, and to create simple visualizations. Newer applications like Tableau have raised the stakes and lowered the barriers—now users can join disparate data sets, they can build reusable custom variables, they can even create in-memory OLAP cubes on the fly!
Giving users this level of ability to interact with their organization’s data is powerful and done correctly it should vastly improve the ability to leverage the data collected and stored. However, the tools that are now at a users’ disposal can outstrip the organization’s ability to manage data, and what is increasingly common is proliferating shadow data sets, usually point-in-time extracts, which are then further manipulated by tools whose purchase or use may not have been well vetted.
At best we see institutional data become even further removed from their original source, whatever that may have been. This situation is especially fraught as multiple facets of data governance, from data quality to data lineage to shared definitions, tend to get swept under the proverbial rug when users mix and match data sets across time and systems. Here, too, security concerns are an issue, although probably not in the same way as with shadow databases and systems; still, any workaround that takes data from secure environments runs the risk of exposing sensitive data about individuals or confidential information about an organization.
The impulses here are understandable, even laudable. We want answers and insight now, not later—the time to iterate new elements into our data warehouse or even simply the semantic layer of our reporting tools may be too long, and many users have no need for this enriched data. With no central catalog of available reports and analyses, to say nothing of rich descriptions of derived and calculated data elements, users must fall back on private notes and their own curation, whether they are accurate or even relevant. Increasingly, valuable data comes from multiple sources, and the task of aggregating data sets for analysis is too great for existing BI offices or tools.
The worst option is this: users pull some data according to their own whims, based on what they think are the correct parameters and after making some guesses about inscrutable field names. Then they realize that they need more data, perhaps even from a different system, so they create a frozen data set from that source, then join them together in some fashion. Some analytics tools make this joining simple, but just because we have linked a person’s ID in one data set to an ID in another doesn’t mean we have brought together a unified set of data.
Now imagine that we couple these results, which is likely to represent tens if not hundreds of hours of work, with the ethos behind traditional shadow systems. What if users try to maintain this sprawling data set as an ongoing analytics source, which means trying to update it on a record-by-record basis? What if they “just” use it as a historical touchstone, and compare the results of new calculations from new data with the ones stored here? What if they try (and perhaps fail) to reproduce their steps every year so that they can have a time series of “frozen” files for attempting longitudinal analyses?
Ultimately, we would expect to see the same problems that result from shadow data systems to manifest and plague shadow analytics including:
- data quickly gets out of date
- data integrity is difficult to maintain
- lineage and provenance cannot be established so validation is essentially impossible
- data is sourced from and displayed across multiple systems and tools, so consistency is almost certainly sacrificed
- in some ways, we are now exposing even more institutional data to more vulnerabilities.
As we have mentioned, the goal of data governance is to help people do their work more effectively, which is the same goal people have when they resort to satellite systems! So what can a commitment to data governance do to help people who need to analyze and interpret data?
Let’s recognize that the tools most users want are basically from the shallow end of analytics, and that many of the questions that are asked are asked repeatedly. Why don’t you have reports or dashboards with quick answers to common questions?
Most institutions don’t have an up-to-date data catalog that:
- identifies data deliverables (such as reports or analyses)
- tells users what those deliverables contain and what sorts of questions they answer
- makes clear how and where to find/execute them
We know from many industry surveys that data analysts spend an inordinate amount of their time cleaning up and reorganizing data sets. Shadow analysts don’t have the same skill set when it comes to cleaning data, and they don’t have the same ability to recognize spotty or unreliable data—so it’s ever more critical to provide these users, not only with curated data products, but also curated data sets.
We could devote another whole blog post (and maybe we will!) to what’s meant by a curated data set, but let’s at least establish some key features:
- data lineage
- consistency in names and descriptions
- simplicity and repeatability in extraction
- “virtualization,” by which we mean any of a variety of methods to make sure data sets don’t sit and age on a user’s workstation or in their cloud storage
A curated data set doesn’t make available all data from all corners, it relies on knowledge of the business and its domains and needs in order to identify those areas and cases with the highest impact. This knowledge doesn’t inhere in your business analysts and report writers, it really comes from your subject matter experts and data stewards. If their wisdom isn’t collected and shared, then curation efforts should be expected to fall short.
Something we’ve said at IData for years, but which has never been truer is this: “Train in the data, not the tool.” It’s not by any means a bad idea to invest in key data tools and expect your employees to use them, but we also know from experience that they will use whatever is at hand to help meet their needs. Besides, it’s not the tool so much as the skill and the talent to use it that matter. Your employees need to use data in the course of their work, which will require that they build data fluency and data literacy, get exposure to and understanding of quantitative and statistical analysis, and grow comfortable with visual and numerical information displays. This growth will occur sooner if employees don’t also have to create their own data sets and find their own tools for analysis.
A searchable and current data catalog means budding “citizen data scientists” can use objects and tools already in place, and it means that whatever changes or enhancements they request can be accommodated in the existing BI environment. Curated data sets mean that data presented in an analysis can be vetted: lineage can be traced, usage and terminology can be regularized, and cleanup is handled not by individuals but via named and tamed processes. Finally, employees with the tools, resources, and time to explore their organization’s data will build the familiarity needed to identify useful data, interpret it meaningfully, and apply it to decisions and operations.
The problems caused by shadow data systems and shadow analytics are by no means simple to solve. But the only way to way to address those problems is to provide users with the access to data that they need. If you lock data away, or if you provide only partial curation, or if your organization cannot recognize and respond to its employees’ data needs, then don’t be surprised to find them doing their work in the shadows!
IData has a solution, the Data Cookbook, that can aid the employees and the institution in its data governance and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.
Let us know about your thoughts on shadow analytics and systems.
(image credit StockSnap_M86IE4N066_ShadowSystems_BP #1085)