80% Preparation, 20% Misinterpretation: Citizen Data Science in the Context of No Governance

80% Preparation, 20% Misinterpretation: Citizen Data Science in the Context of No Governance

StockSnap_D3AFE4D6F8_DataScientistTools_BPThis post is inspired by two pieces of terminology that have seemingly been everywhere this year, and one factoid that’s been circulating for years. One term is “self-service business intelligence,” and the other one is “citizen data scientist.” And the factoid is that data scientists can spend up to 80% of their time preparing data for analysis, rather than performing that analysis. This post was adapted from a presentation delivered at the 2019 AAUDE Annual Conference in Eugene, OR.

Self-service BI (SSBI) refers largely to new tools (Tableau is the poster child but not the only one) that make it easier for users to interact with data, pretty much at every point in the chain. It’s easier to get the data set in the desired format, it’s easier to join data sets together, it’s easier to query and manipulate the data, and it’s easier to display the data. SSBI could also include although technological developments: SAAS; virtualization, cloud storage and the generally negligible cost of data storage; high speed internet and powerful microprocessors; and so on.

“Data scientist” is odd nomenclature (rare indeed these days is the scientist who does not work with data, and while the rigorous analysis of data has much to do with the scientific method we’re still working in the realm of mathematics), but for our purposes let’s accept that it’s a person who is trained to seek out and present insights gleaned from data. A citizen data scientist, then, must be a user or analyst who is not a database expert or statistician or even a BI professional in the traditional sense, but who is interested and somewhat skilled at – or at least not afraid of – working with data at scale.

At least two strains of practice have led here. One strain is that IT and traditional BI have explicitly tried to empower users, first with simple queries built on prebuilt data slices and more recently with some visualization features (often in the same tool). A second strain is the rise of Google and Tableau and other boutique products (e.g., Blackboard Analytics and Pyramid as analytics tools and Slate et al. as data-enabled end-user applications). Both are, in our view, positive developments. Traditional IT is best used in a complementary role as rare indeed is the application developer or even analyst who has a deep understanding of business data. Although more data isn’t always better data, improved access to (wider swaths of) it more quickly certain opens up more options. Analytics tools help bridge the gap between operational reporting and deep analysis, visualizations help more people understand data in action, and (limited) interactivity can lead to more exploration and deeper insight. Finally, best-of-breed tools help us target areas of weakness and provide strength and growth.

The promise of self-service BI tools and platforms is that both (trained and experienced) data analysts and casual business users (a/k/a citizen data scientists) are now empowered to unearth timely data insights on their own and to provide support for meaningful decisions without having to wait for assistance from IT. It's the perfect situation for more agile, insightful business intelligence and therefore greater business advantage, right?

Remember the statistic we referenced earlier about how much work has to be done preparing data, just in order to do a little work analyzing that data? There seems to be some disconnect between the promise of the tools and the actuality of the data and systems.

Prior to the advent of modern BI tools, the people we know call data scientists would receive or in some cases generate a huge data file, say a year’s worth of student enrollment, which they would then hope to analyze, say in statistical analysis software. Were there missing values in that data file? Were there duplicates? Or rows that looked like duplicates because a critical column hadn’t been included in the output? Were there inconsistencies in the data, particularly with respect to date fields and free-text fields? After resolving/excluding these issues, was the resulting file then ready for analysis? Of course not. We still had to impute values, convert qualitative values to quantitative and vice versa, turn numeric values into ranges & buckets, etc.

The challenges of SSBI and citizen data science are consistent with the higher education reporting and analysis situation we have experienced for many years now. The technical challenges around data preparation are far from new, so current manifestations strike us as intensifiers of degree rather than kind.

Data scientists come to the table with among other things a facility for working with data sets large and small, a healthy skepticism for received wisdom and a possibly unhealthy conscientiousness of data accuracy and integrity, experience with data manipulation and analysis, and often significant training in research methodology.
Many of them still have to learn database query tools, useful visualization practices, data lineage, and perhaps most importantly what data their institutions traffic in and how people actually use it.

Nevertheless, because of their predilections and training, data scientists were (and are) willing to do the hard work of data preparation, without cutting corners, and for the most part without relying on tools to do this prep work for us. Now we have introduced tools that allow users to do their own data preparation, to query data sets without really knowing how to write queries, to create calculated variables without understanding covariance, and to do most of this work without much training or oversight. What could possibly go wrong?

These developments in SSBI are exciting, and promising, but they have not solved every data-related problem. Not only that, they have exacerbated some pre-existing conditions, and, in some cases, they have swapped new problems for old workarounds.

Pre-existing conditions exacerbated:

  • lack of trust in reported data due to fluctuations in outputs depending on who provides them and when;
  • weak underlying data quality fundamentals brought into sharper relief as data travels across more systems and as the ability of one person or team to clean data (see data prep later) fades away;
  • data lineage (where did this data come from) and validation (how did it turn into this report) further removed from enterprise applications.

“Swapped in” issues:

  • transactional siloes give way to analytical siloes;
  • problems curating data deliverables metastasize as now we run the risk of poorly curated data sets;
  • polysemy in metrics in some ways is even more problematic than polysemy in operating terminology. (It was one thing to realize that there were multiple ways to count full-time students and to calculate retention rates based on those counts; it’s another level of complexity altogether to build a predictive retention model and to use it as part of your data regime.)

If anything, data science in general, no matter who practices it, and self-service BI have made the need for curated and governed data even more acute.

IData’s model for data governance has always been based on documentation but centered on communication and collaboration. So not only, “What does this data mean”, “Where does this data come from” and “How did it get here,” but also, “Who’s responsible for this data”, “How can it be used”, “What is the context” and “Who else needs to be involved.” When we talk about data governance, we want to talk about specific things you can do and actions you can take. One benefit of governed data is that it follows rules and standards, rules and standards you can document as part of your knowledge base. You can also take actions to standardize or improve processes in diverse areas such as data production, data quality, data access and security, and so on. We also emphasize the concept of “just-in-time” data governance. Solve the problem in front of you but use tools and techniques that can be brought to bear on the data problems you’re sure to run into eventually. We have yet to see anything about the new self-service regime that tells us our data governance model is out of date.

Many of our clients already have a situation where three different methods result in three different answers, so why would we think simply using another method going to help? Moreover, what reason is there to think that my citizen data science methods – remember, I’m neither a database professional nor a researcher – are the ones to provide the best or most accurate answer?

Whether we like the direction these self-service tools have gone, whether we agree that data science is a real thing somehow different from tried-and-true analytical practices, it does appear that something new is afoot. The genie is not going back into the proverbial bottle. And, to be clear, we don’t think that should be the goal. Locking down data and restricting access to information has not led to the kinds of creative insight that are increasingly needed in higher education, health care, and nonprofit management. But unfettered access and unmonitored usage is not the solution, either.

Our product, the Data Cookbook, is designed to help institutions manage and navigate both the explosion of data and data sources as well as the explosiveness of data analysis, by providing a framework for curating and governing data across the enterprise.

We deliberately use the concept of a business glossary rather than a data dictionary. A data dictionary provides technical information people can’t understand, in a format they can’t read, in a place they can’t find. That’s an important purpose but it’s already being met in spades by software and database vendors. Our target is a little different.

Every client we’ve ever worked with has hundreds if not thousands of aging and virtually identical canned reports that are owned by no one and used by no one, at least until someone decides to clean up. A usable data catalog at a minimum tells you why a report exists, what it does, who’s responsible for it, when it was created, and of course what it includes and how its data works.

And when we say how its data works, we mean document where information is sourced, and how it is manipulated as it moves from system to system or from database to BI tool to output. All this information is easy to document and then search for within the Data Cookbook. If people find what they think are data quality problems, they can report them via the Data Cookbook, and associate them with a data catalog item or a business glossary entry if appropriate. We can also document data quality rules and standards, both stand-alone and embedded pieces of data terminology. The profusion of self-service tools and custom data sets often means that IT, for example, doesn’t know about all the tools and storage decisions. The Data Cookbook helps you keep a central tally of this kind of infrastructure.

Ten or fifteen years ago we might have thought that data fluency was the big challenge ahead of us: data providers didn’t know how to present it (how to tell stories with data), and consumers didn’t know how to understand it, to say nothing of making decisions based on it. Data fluency continues to be a challenge, although in different ways and for different reasons. But the failures we see in business intelligence and analytics almost always have their roots in data governance breakdowns.

We often say when working with new clients that data governance is what you do when you recognize that data is an asset and you start to treat it like one. It’s both a simple and simplistic formulation, but we like it. Assets need to be maintained, built up, leveraged when possible, stewarded, and used wisely. That’s hard enough to do when only a handful of people touch them; now that we’re throwing open the doors to the treasury, a simple system to identify who’s using which pieces of gold for what purposes doesn’t seem like all that much to ask.

IData has a solution, the Data Cookbook, that can aid the employees and the institution in its data governance, reporting, data-driven decision making and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.

(image credit StockSnap_D3AFE4D6F8_DataScientistTools_BP #1098)

Aaron Walker
About the Author

Aaron joined IData in 2014 after over 20 years in higher education, including more than 15 years providing analytics and decision support services. Aaron’s role at IData includes establishing data governance, training data stewards, and improving business intelligence solutions.

Subscribe to Email Updates

Recent Posts