Managing Data Sprawl

StockSnap_OBR3AXHVMQ_airplanepilots_datasprawl_BP Recently we discussed the concept of data sprawl. We find ourselves very much in the era of data sprawl, with so much data available for capture, so many specialized tools for its collection and storage, and so many users seeking to interact with it. Today we want to follow up on this concept, to discuss some of the risks of data sprawl, some of the reasons it has come into existence, and some ways to harness sprawl for your organization's benefit.

What are the risks associated with data sprawl?

Perhaps the most visible risk, as we mentioned in the previous post, titled "Understanding Data Sprawl", is the possibility of a data breach. More systems mean more logins, more passwords, more servers, and more databases; all told, that’s simply more points of entry (for cybercriminals) or more points of failure (to follow good security protocols). Vendors’ security practices may be weaker than yours (although in fairness we should point out that many vendors may practice even more rigorous data security). The further that sensitive or valuable data resides from your information security office or your servers or your review, the more likely it is that it will go unclassified as sensitive or valuable, and the more likely it is that if it is breached the effects will not be known for some time.
A similar risk has to do with the possibility of noncompliance with data protection regimes such as GDPR. Again, in the era of sprawl, how can an organization know exactly where protected data is stored? What guarantees can be made as to how quickly can that data be retrieved for examination when needed? Should data need to be destroyed or modified, how certain can any organization be that every instance of that data has been addressed? Failure to demonstrate compliance can result in some kind of legal or financial penalty, so there could be very real consequences to running this risk.
Data sprawl makes it very difficult to know what data you have collected, where your data is, and who is responsible for it. Knowledge about remote data is often housed in one business unit, or even one or two people in that unit; when those people leave, or when that unit is reconfigured, the value of that data could be reduced, or even lost entirely. In some cases the data itself could essentially be lost, if the only users who know how to access or obtain it are no longer with the organization!
Another risk related to data sprawl is lost or diminished productivity. With data sprawl, a dedicated storage and/or security team, if you’re lucky enough to have one, probably spends time managing multiple data sources and maneuvering around silos that resist automation. And it’s just as likely that a valuable person in a business unit trades some of their business value for system administration when they acquire a unit-specific data tool. Another blow to productivity that hits closer to home for us occurs when a data product is derived from one of these systems into which there is little visibility. In that case we often have to spend time confirming the validity, reliability, and in some cases even the very existence of this data. Which means, of course, that less time is available to deal with the consequences or insights of that product.
Finally, we run the risk of incurring an opportunity cost with data sprawl. Additional data applications require additional resources for implementation, support, troubleshooting, integration, and so on. Those resources could be put to use elsewhere, right? And potentially they could add more value in those other uses. A little nearer and dearer to our heart is the possibility of a decision support opportunity cost. In most cases, we want to bring data from across the entire organization or company to bear when we make strategic decisions, but data sprawl can make it harder to collect, aggregate, and explore data in our analytics framework.

Why does data sprawl happen?

Employees don't believe they have the tools they need, or they think that the tools they do have are too difficult to work with, so they may search for new tools while they develop workarounds and quick fixes. New data applications generally mean new data, and more data, and potentially duplicated data. Employees are trying to address some business need or meet some business goal, and they tend not to be as knowledgeable about software and data tools in general. So the potential for data sprawl is likely not to be on their radar. They can be susceptible to slick presentations, and they can develop tunnel vision about the scope of data problems and the solutions to those problems.
We have seen it asserted that the primary cause of data sprawl is unknown use of SaaS applications by employees, which seems like a bit of an oversimplification. Most of our clients have awareness that these SaaS applications are being used—what they don’t know is what data is being collected, and why, what the data storage model and technologies are, and so on. Data sprawl is a risk any time you collect data without a plan for its use, or without consistent practices around gathering and classifying it, no matter how few tools are in use.
As with many data-related concerns, insufficient preventive standards or guidelines are in place. We have observed entire shadow economies, if not exactly black-market operations, where data is collected, stored, and reported on almost entirely “off the books.” Historically this was done using spreadsheets and desktop databases, which were probably worse than using cloud storage and SaaS applications from a security and auditability perspective! Still, the scope of data being collected and (mis)managed using desktop software undoubtedly pales in comparison to newer enterprise applications. Many data users are unaware of best practices when it comes to collecting and storing data—but in our experience they are willing to adapt their behavior and practice when the organization articulates meaningful protocols.
Key technical considerations are not part of the purchasing process. Now, we understand that software and technology purchases are still driven by business exigency, but that doesn’t mean that a technical perspective prior to purchase and implementation shouldn’t be sought! Technical resources may know something about a potential vendor’s reputation or business history, they can identify additional support needs, and they can speak to added complexity around data storage, data retrieval, data security, and so on.
It can often be the case that data quality is not a sufficiently high organizational priority, so decisions that further compromise data quality continue to be made with impunity. A large aspect of data sprawl is simply the sheer amount of data out there, much of which may not be useful to you because it’s either duplicated or redundant or out-of-date in some way. The more data integrations you have, the more opportunities you’re likely to see for outright failure, or for introducing data integrity or consistency issues.

How can we manage data sprawl?

Commit to data intelligence.

Do you have an inventory of the data systems in use at your organization? Does that inventory include information about the kind of data collected/stored in each system? Not just technical metadata or data models, but also whether the data is confidential, sensitive, contains personal identifiable information (PII), should be time-boxed in some way, etc. An inventory like this is critical to your data intelligence work in general, and it is a basic, simple, yet often quite effective way to identify when unnecessary software is being suggested, or to determine when a data application should be re-evaluated.

Take data governance seriously.

We noted above that in the absence of policy or clear direction, data sprawl is a common occurrence. And to some extent, data sprawl is probably inevitable. But just as some cities have avoided the worst excesses of urban sprawl by embracing “smart growth,” a similar set of techniques is available to your organization, by doing your best to know where your data comes from, where it resides, who maintains it, who decides how and when it can be accessed, and so forth. Data policies that guide data collection, storage, access, and sharing are not difficult to create, although they take on much greater value when they’re publicized and enforced.

We’ve written in this space about data lifecycle management several times. Check out our blog post titled "From Acquisition to Disposition: Data Management Pitfalls throughout Your Data's Lifecycle". Most of the focus on lifecycle management has to do with understanding the flow of data between systems, or from storage systems to analytical products, and of course clarifying end-of-data-life practices. In addition to policies governing what data can be collected, and how, it is useful to have standards that answer the question of why a particular data element or data set is collected, and how it is intended to be used.

Build up and continually enhance a culture of data.

Data literacy, data knowledge, and a commitment to data-informed decision making go a long way toward managing data sprawl. Data literacy, however you define it, and whatever baseline level you aspire to, helps employees understand the value and utility of data, it helps them separate data wheat from data chaff, and it contributes to an environment where data insights are continually sought and shared.

Data knowledge helps employees find out what data is already being collected within the organization, and who is responsible for it, and what are the appropriate standards for accessing, using, and sharing that data. Data knowledge is also evident when employees are encouraged to use existing tools to the fullest extent of their capabilities; when we see data sprawl, we tend to see not only duplicate data elements but duplicate data functionality.

Decision makers who are public about using data to inform decisions, and who are transparent in their process, and who actively seek disconfirming data, tend to make better decisions. There is no shortage of surveys and studies showing the competive advantage of data-enabled organizations. But the data cannot simply come from Unit A or System B. Even the most well-curated analytics data store can come up short if it lacks critical data that sprawl has prevented you from including or reviewing.

Our data intelligence and data governance solution, the Data Cookbook, has core features that can be foundational in identifying, dealing with, and ultimately controlling data sprawl. From cataloging data assets to empowering (and holding accountable) data stewards to making information about your organization's data holding publicly available, the Data Cookbook is your tool to enable smart data growth.

Feel free to contact us and let us know how we can assist.

(Image Credit: StockSnap_OBR3AXHVMQ_airplanepilots_datasprawl_BP #1181)

Managing Data Sprawl

Managing Data Sprawl

About the Author

Subscribe to Email Updates

Recent Posts

Archives

Categories

Address