Important content in a data governance knowledge base is a data flow catalog which contains information about such items as reports, dashboards, ETLs, data flow integrations, and surveys. Maybe you call this a data processing catalog or a report catalog. In this blog post we will mention what information should be tracked in a data flow catalog.
Here is what should be tracked:
- Process Name or ID – This might be a functional name. But, for example, you might also call it after the process id like Informatica process ABC123. But you might also want a user-facing name like Applicant Export.
- Data Flow Processing Type – such as Informatica export
- Tool – such as Informatica or Tableau
- Functional Areas or Domains
- Source System or Protocol System - What the source data system is or the protocol system, meaning that it is coming from a flat-file or an API schema
- Target System or Protocol System - What the target data system is or the protocol system, meaning that it is going to a flat-file or an API schema
- Selection Criteria (Context) – Provide some context about the data that's coming out, such as this data is only including data with the status of X or only data that has changed since the last time the data flow item was run
- Purpose – For example, this exists to integrate two data systems
- Usability Rating - This is something you can decide if you want to use. For example, this is a certified report or a trusted report, or an untrusted report. Or it is an active do-not-use type of status.
- Policy Attributes - For example, making note of data flow items that contain PII or require a data sharing agreement
- Owner – If there is an explicit owner this should be mentioned. Might be tied to a functional area.
- Access or Running Instructions – Add information about how to access data flow item, how to run data flow item, and how to get the data flow item. There should be a link to request data flow item.
- Request Information – Information such as requestor, current owner, and status
- Versions and Status - Has the data flow item (such as a report) been developed yet? Is it complete? Has it been curated yet? For example, this report has been changed three times, and here's the history of that change, and the most recent version is complete.
- Processing Frequency - How often does this run? Is it something that is run on demand? Is it something that's run daily? If difficult to track, point to where this information is discoverable. Mark how the frequency is set.
- Related Processes (sequences and dependency as well as Collections) - Let's say that this is an ETL process that runs on admissions data, or it is application data that is getting loaded for a student, a prospect, or an applicant. And you have one data feed that loads all the demographic information and then another data feed that loads the application. It is important to know what the application process is related to and happens in sequence after the demographic process. And if the demographic process failed then the application process fails. In the Data Cookbook solution there is the concept of a collection. And a collection is essentially multiple specifications that you bring together for something like a factbook.
- Actual Processing Code (or link to code location) - You want the actual processing code, like if it is a query or an extract or there's other kind of native programming language around this that you can import in. Or link to where the code exists. In some cases, it is important to do if you are trying to inventory this and later curate it because the inventory process might only get the name, the type, the tool, the source, and the target. And to drive the rest of it, you are going to want to look at the code. If not, where you can find the code, so that someone who's going to curate this can then link over and look at it.
- Mappings and Lineage (labels, glossary, data model objects) – Not required but even if you do not track the individual entities or objects, relationship and transformations that are happening there, you are getting an immense value out of understanding, either at an inventory level or a curation level or both, what these processes are, where they exist, how do I find them, and what do they do there, even if you do not have that lineage piece. This is why a lot of people come to us and say we want to track lineage. And when it really gets down to it, they do not need to do lineage yet, or even at all. They just need to understand what processes exist. And at the minimum, the lineage you are trying to track is system-to-system lineage, which you capture here.
Hopefully, this blog post was helpful in talking about the different types of information that you should collect for a data flow catalog item. Our YouTube video on this subject is located here. Other data flow catalog resources (blog posts, videos, and recorded webinars) can be found here.
IData has a solution, the Data Cookbook, that can aid the employees and the organization in its data governance, data intelligence, data stewardship and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.
(image credit: StockSnap_GD2LZQ7U05_waterflow_infodataflowcatalog_BP #1253)