As long as the notion of big data has been around, we have known that organizations of all sizes are exposed to more data, more quickly, across more systems and platforms. (These are the three Vs: volume, velocity, variety.) When it comes to really big data, of course, most organizations are not acquiring or storing that data; rather, for certain data research, analysts dip their toes into publicly available data pools, or temporarily enrich permanent organizational data with other data sets.
But even the smaller organizations we work with are experiencing something like what we recently saw referred to as "data sprawl." When we read the phrase, we thought immediately of urban sprawl, which we understand as the unchecked and often unregulated growth in and around cities. Sprawl is associated with the growth of suburbs and exurbs, and their effects. As metropolitan areas grow in geographic size, it becomes more difficult and often more expensive to travel in and around them. Services and resources that are only available downtown may see less usage, as people live and work further from them. Sometimes those services create satellite offices, which may affect their quality, or their ability to be consistent. Entirely new modalities and paradigms around housing, commuting, schooling, raising children, etc., spring up.
Some people consider urban sprawl unsightly, as mixed-use communities where people live, work, and shop give way to massive shopping plazas, office parks, and less dense residential housing. Urban sprawl increases pollution and heat islands, it tends to reduce green space and natural habitats, it contributes to housing and transportation challenges. From a public health perspective, it probably makes people less healthy; from a city planning perspective, it is almost certainly an inefficient use of resources.
But sprawl has real benefits to those who prefer to live in suburban, exurban, or even rural areas. Land and housing may be more affordable, traveling by car - even during congested traffic - might be preferable to public transportation, consumer options might be more plentiful or in some ways more convenient. For many people, the benefits of growth and expansion outweigh the disadvantages of sprawl.Taken literally, data sprawl probably refers to a situation where an organization has collected so much data, and stored it in so many places, that it no longer knows what it has, or where to find what it's looking for. Data virtualization, software- and platform-as-a-service applications, and cheap and reliable cloud storage are key technological contributing factors, but nothing prevents data sprawl from occurring even if you host all your software and store all your data on premise.
Taken somewhat more metaphorically, however, we can observe some of the characteristics of urban sprawl when we look at data sprawl.
- The more places you store data, the harder it is to say what data you have, and to access it when you need it. (Think of this as a traffic issue.)
- The more places you store data, the less certainty you're likely to have in your system(s) of record. It could well be that data is more up-to-date in a noncanonical system! (Maybe this is akin to fewer people visiting downtown, or visiting it for narrower reasons and for a shorter timeframe.)
- Even with the best of intentions, new data integrations give rise to new data integration issues: incompatibilities with preferred tools and techniques, the need for integrations to occur on a schedule the data ops team cannot really support, a poorly written vendor API, etc. (This may not be a loss of green space-what would that mean for data anyway?-but it certainly sounds like building and maintaining roads and highways, or extending bus lines and train tracks.)
We have to accept as a given that more and more data is going to be generated, that more and more data will be available to us to collect, consume, and otherwise interact with, and that more and more data will eventually be required in order for us to provide the same kind or level of service we now provide (or to respond to new market demand, or otherwise continue on as a business). And we have to accept that the tools and technologies now in place will not suffice, if they ever did. Our centralized ERP systems and mixed-use reporting data stores have been supplemented, and in some cases supplanted, by boutique applications and bespoke tools. The business units who acquire these tools may find themselves more productive, or at least happier in their use of these tools; but this greater independence can carry major consequences for the organization.
One consequence is that overall, data quality can easily deterioriate: there are too many opportunities for inconsistency and validity issues between systems, and too many new places for data ROT (redundancy, obsolescence, triviality) to occur.
Another of the concerns with data sprawl is that you expose yourself as an organization to a greater risk of a data leak or a data breach. There are more points of entry, more user turnover, more users who are further from central data authority, and, frankly, more systems with lousy security features.
A third consequence is that more data is not necessarily better data. Questions about provenance and lineage will abound, of course. More raw data is more fodder for analytics, but analysts need to know of the existence of new data, they need to know the business need it is collected to address, and they need to know where additional data might enrich existing analytics products. How does any of that happen?
What should organizations do in this era of data sprawl? We could, of course, commit to better, or more sustained, or more meaningful data governance. Data governance won't shut off the flow of new data (and it shouldn't have that as a goal), and it's far from clear to us that even the most robust data governance framework would even do much to slow down this influx of data. But careful application of data governance principles could help us decide where to build the pipeline, so to speak, and what do with the data when it emerges from that pipeline (on our doorstep?).
What does data governance in the era of data sprawl look like? From our perspective, it seems even more critical to pull back from the too-myopic-for-too-long view of data governance as a set of security protocols and data protection measures. If we really want a data-enabled workforce, then we have to expect data citizens to make decisions for themselves about the data they want to collect, the tools they want to use, the storage mechanisms and data sharing agreements they want to authorize. We believe that data governance sounds more appealing, and much more feasible, when it involves providing users the tools, knowledge, skills, and other resources they need so they can use data effectively, ethically, and securely. Additional resources (videos, blog posts, recorded webinars) on data governance and data intelligence can be found at www.datacookbook.com/dg and or in our "Wealth of Resources on Data Governance and Data Intelligence Topics and Components" blog post and is located at: https://blog.idatainc.com/
In our "Managing Data Sprawl" blog post we discuss some specific situations where data sprawl occurs, and some of our recommended data governance practices and activities to help you manage data sprawl.
IData has a solution, the Data Cookbook, that can aid the employees and the organization in its data governance, data intelligence, data stewardship and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.
Photo Credit: StockSnap_ZNUHYB5G3U_houses_datasprawl_BP #B1180