Do You Know Where Your Data Has Been and What It’s Been Doing?

Do You Know Where Your Data Has Been and What It’s Been Doing?

StockSnap_FCLNL5MS30_datalineage_BPThe profusion of data sources and systems in higher education introduces plenty of challenges (headaches?) for data managers and consumers. One of the most difficult challenges is also one of the most basic: tracking data, whether it be about individuals, transactions, classes, what have you, as it moves from system to system. This is data lineage.

In higher education it is now common to see data collected originally via a CRM as prospective students identify themselves. Some of that data makes its way into an SIS or ERP system as a student is admitted and enrolls, where it is likely to be supplemented by data from external sources (e.g., financial aid information via FAFSA) and other campus systems (LMS, housing/residential, student activity tracking, etc.). If the student is employed on campus there may be additional information added to their data record. Upon graduation some student data could move into an alumni/development database, where alumni activities, donations, and additional employment and address information might be maintained for decades.

Data lineage, if recorded consistently throughout this process, can tell us exactly how any piece of data was captured, when, and by whom, and what path it has taken to get where it is right now. Over the course of a person's relationship with an institution, dozens of offices, scores of data systems, and hundreds of users might be involved in producing, capturing, and maintaining data. It's easy to identify multiple possibilities for erroneous entry, for poor integration or data transfer, for aging or loss of data, for improper merging or disposition of records, etc. In the age of GDPR and high-profile data breaches, enumerating data lineage, to say nothing of managing it, can be a daunting proposition.

What do we need to know when we refer to data lineage? Obviously, we want to know the original source of data, including the system where it was first stored. It’s not enough just to track the movement of data, but also what happens to it as it moves. Data lineage really shows its value when we can identify changes, adaptations, updates, deprecations, etc., across all the systems where data may reside, whether permanently or temporarily. Sometimes this audit trail is necessary in order to demonstrate compliance, and at other times this record of changes simply improves transparency and communication. Data whose provenance is clear, and whose integrity can be demonstrated, is data that people are predisposed to trust.

A traceable data lineage is obviously essential in our efforts to maintain a high level of data quality, as data integrity and consistency are challenged by the movement of data from system to system, by changes in control from one office to another, and of course, by new information that supplements or replaces existing data. But data lineage also comes into play in decision support and business intelligence as well.

How many times, for example, have we heard senior staff ask data providers, "Where did these numbers come from?" This can occur when the numbers seem to fly in the face of what had been expected, or what was reported previously (last week or last month), or, far too often, when multiple data providers offer up conflicting or at least disharmonious figures.

In contemporary data environments, you may be sourcing data for sharing and analysis from multiple places: a data warehouse or data mart, often filtered through some kind of operational data store; a frozen (point-in-time) file or snapshot of data; up-to-the-minute transactional data; on-the-fly data sets (think of analysts who extract custom data sets in order to display information using visualization tools or analyze it using a statistical package). Understanding data lineage enables us to validate data delivered to consumers for them to act upon, and to identify where errors, inaccuracies, or misunderstandings could have been introduced prior to data delivery.

Having a more complete view of data lineage is not in itself especially valuable. It seems reasonable to expect stricter data regulations in the future, and steps taken now to comply with those regulations are certainly prudent. But strategic benefits can be realized right now from improving the reliability of your data, clarifying the sourcing of reported data, and contextualizing the combination and or separation of data elements. Data deliverables whose provenance is clear, whose methods can be quickly explained, whose integrity is assured, and whose elements can be spot-checked against transactional sources are deliverables that can be trusted, and in turn relied upon to support institutional decision-making.

Ultimately, your work to document data lineage should be part of your broader efforts to understand and manage the entire data lifecycle, from acquisition through utilization and transformation until disposition (or archiving/permanent retention). But the full picture of data lineage can be overwhelming, and it may seem unattainable. This is why we encourage our clients to start small, and to start with the critical stuff, such as master data (data, often about individuals, that is used by many offices and may need to persist in your systems for unspecified but lengthy periods).

Ask yourself some questions and document your answers:

  • Do you have a system of record that receives data from another specialized tool?
  • What fields go from one system to the other?
  • What transformations, if any, occur during that transfer? What does the process entail? (Is it a cron job or scheduled task, does it feature automated ftp or an API call, etc.?)
  • Who is responsible for what tasks?
In our opinion the above is not enough – although it is a good start – to store object-to-object relationships. It’s the process, the human intervention, the application of business rules and subject matter expertise, that provides meaning to data movement; without the process we can understand the how but not the why of our data’s lineage.

IData, expert in data management for higher education institutions, provides the Data Cookbook solution, the leading governance solution for higher ed, which facilitates your data lineage management efforts. Integrations, flat file extracts, ETL processes, and other data movement can be documented as specifications. Definitions in the Data Cookbook’s business glossary can have multiple technical definitions: one for your CRM, one for your ERP, one for your LMS, etc. While these don't necessarily demonstrate the movement of data, they do combine to form a picture of data at various points where it resides. Data quality rules and reference data roll-ups are in the Enterprise Edition of the Data Cookbook, and each of these features can be utilized in service of your data lineage efforts.

Feel free to Contact Us if interested in getting assistance.  Feel free to Contact Us if you would like to discuss how the Data Cookbook can assist your institution.  

 (image credit StockSnap_FCLNL5MS30_datalineage_BP #1073)

Aaron Walker
About the Author

Aaron joined IData in 2014 after over 20 years in higher education, including more than 15 years providing analytics and decision support services. Aaron’s role at IData includes establishing data governance, training data stewards, and improving business intelligence solutions.