More Thoughts on Analytics, AI, and Data Quality

Written by Aaron Walker | Jul 10, 2025 9:47:29 PM

We return to a topic that has interested us lately, which is how organizations might be able to use generative artificial intelligence, or GenAI, to supplement their analytics efforts. Particularly in light of recent developments in reasoning augmented generations, the potential for large language models (LLMs) to parse data, test hypotheses, and draw inferences is staggering. However, we have heard many, many voices cautioning organizations that want to proceed down this path. Many of these cautions have to do with the importance of data quality for generative AI tools, and we thought it might be worth unpacking these concepts a bit further in this blog post.

Many articles, whether they strike a cautionary or cheerleading tone, assume that GenAI as data analyst is a foregone conclusion. Maybe we're just a curmudgeonly lot, but these assertions and assumptions seem a bit, well, handwavy. The general implication seems to be that you're going to use AI for this work at some point so you'd better get your data into a condition where the AI-augmented output can be trusted. Otherwise at best you'll fall behind your competitors, and at worst you'll get biased or hallucinatory results, which will probably cause you to fall further behind your competitors.

It seems to us that if this level of data quality, whatever it is, were easy to achieve, organizations would already have achieved it! And it hasn’t been our experience that our clients lack interest in maintaining clean data, or that they lack the will to manage data quality. Rather, resources are finite, and deploying considerable resources to address data quality means other initiatives get back-burnered.

We have observed the following situation among a no small portion of our clients.

An analyst uses an enterprise BI tool to extract a large subset of data for further analysis. Sometimes this data comes from a warehouse or data mart, sometimes from a data lake, and sometimes directly from source systems.
This extract goes through some kind of review, and often significant changes are made to it, after investigating what look like outliers and anomalies, determining what to do about missing values, and so on. These changes could be simple, but they frequently involve excluding records, and more and more it seems we see manual blending in of additional data from other sources.
Sometimes custom or even synthetic elements are introduced that support further analysis, and frequently this initial extract is enriched via blending or appending from other data sources.
Depending on the analytics tool or tools in use, this cleaned and blended data set might undergo additional structuring.
Finally, analytic work begins. Maybe there are some visualizations at the end of this, or perhaps all this work is required to do compliance reporting or respond to audits!

Now, these enterprise BI tools are workhorses, and there are great data blending tools available. But in this type of situation we have seen analysts do a significant amount of work just to be able to load a data set into their analytical tool, whether that’s some kind of stats package, or a visualization tool, or even direct query via Python or SQL or something similar.

Most of us have probably heard statistics that data scientists spend up to 80% of their time preparing data sets for analysis, and a much smaller proportion of their time actually performing and refining those analyses. Regardless of the exact ratio, however, this scenario is concerning, and not just because of the amount of time spent munging (process of changing data into another format) and unscrambling data.

In the process of trying to clean and assemble data for analysis, additional opportunities for data quality issues are introduced. What if an analyst decides to impute values where they are missing based on the distribution of values elsewhere? Sometimes this is a reasonable choice, but not every time. What if, in the process of blending data, duplicate records are created, due to subtle mismatches in coding or formatting? The further this data moves from its point of origin, the harder it becomes for others to audit its travel, so to speak.

Every analyst we work with is scrupulous, and in most cases they double- or triple-check their work. Still, everyone can make a mistake, and everyone brings unspoken assumptions to their work. To the extent that these organizations rely heavily on the knowledge and ability of individual analysts, they already run some of the risks we’ve seen articulated regarding AI: it's difficult, if not impossible, to explain the steps taken; no one else can replicate these results using the data from the agreed-on source systems; all too often the analyst was prompted to look into some data and report back if they found anything interesting. What happens if they report back that they did all this work and didn't find anything interesting?

Given what we know about the actual practice of data science, the need for a curated, trustworthy data set prior to undertaking analytics work ought to be obvious! But let’s not beg any questions here. What does it mean to have a curated and trustworthy data set? It doesn’t mean you have to have a data warehouse, or a lakehouse, necessarily, although those often help along the way. What it does mean, we believe, is that you have agreement at your organization about what data means, how it is to be used, how metrics are defined and calculated and understood. That’s not to say it’s impossible to have certified, reliable data products without a curated data store. Nor is it to say that just because this curation has taken place, any data products created based on that source are automatically trustworthy.

Our experiences suggest to us that humans find accuracy, or how well the data corresponds to what can be empirically observed, and consistency, especially as data traverses multiple systems or periods of data collection, to be the most visible and hence most important dimensions of data quality. Your users probably know, for example, when a customer address is out of date, or whether the contact information for A/R is still good, or whether you really agreed to sell that product to that purchaser at that price! And if dates or currency formats or commonly used codes don't match up, your users will notice, and will point it out.

One thing we know for sure about generative AI is that its relation to empirical reality can often be tenuous; it could just as well assume that you gave a sweetheart deal to that customer, rather than verifying that the feed between accounting systems didn't fail or get corrupted. Your business analyst could make the same assumption, of course, but as we described above they tend to include error-checking and anomaly-spotting in their process.

It may be that other, less well understood, dimensions of data quality may be the most critical to address before your data is ready for analytics, whether AI-augmented or simply tackled by humans. Completeness comes immediately to mind. If, for example, you hope to use analytics to develop a more complete profile of customers so you can target retention efforts, then the data set you analyze needs to include data about customers from all the systems where it’s captured, not just the data that makes it into a reporting repository. Validity, where data is expected to be of a certain type, or to fall within reasonable ranges, when it isn't sufficient, will likely introduce chatter or noise to even basic descriptive statistics.

As we noted above, for many of our clients their analytics outputs require data to make lots of hops. How are you going to achieve that for your GenAI agents, technically? And more to the point, why wait until this late in the data lifecycle to do this work? What if you simply took steps to ensure integrity and consistent data standards across systems and business offices?

We've speculated in earlier blog posts about using GenAI to help identify potential errors in your data, and to execute data quality assessments based on rules fed to it by your data stewards and engineers. If data standards were in place at the point of capture, and for each and every integration, and included as part of all your data set curation and data product certification, you'd have higher quality data to begin with. And with the processing and explanatory power of AI, it might become routine to nip persistent or deleterious data quality issues in the bud.

Capturing data quality standards and relating them to well-defined data elements is a key capability of our Data Cookbook, our data governance / data intelligence solution. We consider this a key component of any mature data governance framework, but we recognize it's something you might aspire to when you're just starting out. Data quality for operations might not be the same as data quality for analytics, given the type and amount of data your organization traffics in.

Any realistic vision for AI and analytics involves amplifying the scope and reach of existing analytics teams, doesn't it? The idea that an organization can go from having essentially no analytics capacity to a robust one by unleashing generative LLMs on its data is part of what we described in a previous blog post as a kind of mysticism about analytics (and data in general, really). We think GenAI probably can help grow analytics maturity, it can amplify the voice of analytics in the organization by providing more insights to more units, and it may well be able to smooth and speed up the processes involved in data analysis.

Simply having access to AI--even if you're good at prompting it and even better at recognizing when it gives you garbage--doesn't obviate the need for doing the hard preparatory work of data governance. Understanding the level of data quality you need in order to perform the kind of analytics you want is part of this work, as well as taking whatever steps are needed to get to that level.

We will come back to this topic in a future post. For too many organizations, analytics has failed to launch, or to move the needle around decision making in a meaningful way. We have some thoughts about how AI might give organizations another bite at this proverbial apple, and what you might do to get it right the next time.

Hope you found this blog post beneficial. To access other resources (blog posts, videos, and recorded webinars) about data governance and data intelligence feel free to check out our data governance resources page.

Feel free to contact us and let us know how we can assist.

(Image Credit: StockSnap_V0AKCFV9QX_ManCleanOffice_MoreAIThoughts_BP #B1297)

View full post