We hope everyone is staying home and staying safe, as much as possible. Being data-minded, we've of course been following the COVID-19 charts, graphs, and projections closely. Huge amounts of data are being collected and analyzed, and as with any data project there are questions. In this blog post I just wanted to share some thoughts.
A very commonly displayed chart shows the rate of infection by country over time. This chart uses a logarithmic scale, in part to better show what flattening the curve looks like, but not every version makes that aspect entirely clear.
Other questions about this chart might include the reliability of the source data, and whether an apples-to-apples comparison is even possible. Do we know which organizations or agencies are responsible for data collection in each country, and what do we know about their methodology? There are strong suspicions that some countries are deliberately underreporting cases. And no matter whether that is a reasonable suspicion, we can be certain that the public health agencies in some countries are better-resourced than in others, and are gathering and reporting more data more quickly.
Even in the United States it's hard to get a clear picture. Some states are conducting more tests than others, both in raw totals and on a per-capita basis. And each state has different criteria for determining who is to be tested. This is understandable, based on the scarcity of tests and assumptions about transmission vectors, but it makes decisions at every level of government difficult. Some states show infection rates of 15% or more, and others show much lower rates. Is the "true" rate somewhere in the middle, or is the difference between states an artifact of testing protocols, or even simply a delay in reporting?
It's tempting to say that more data will clear up some of these questions, but there are good reasons to avoid that temptation. The data we have so far isn't inaccurate, but it's uneven, and it's likely to remain that way for some time. Having more of it doesn't make it cleaner. In terms of analysis, adding another variable to a model doesn't necessarily increase its explanatory power--we run the risks on both ends, from generating more noise to overfitting. And, of course, the stakes are very high right now: both inaction and the wrong action will lead to more illness and more deaths.
At IData we've been beating the drum for a long time in favor of improved data management practices, including a focus on rigorous data governance, widespread data stewardship, and transparent data policies and usage. We recognize that even the cleanest, most complete dataset won't necessarily lead to real insight, and that analytical methods vary. But understanding where and when data is sourced, how it is collected and stored, who is responsible for it, which pieces of it are in stable enough condition to be mined, etc., are pretty solid preconditions for actionable analysis.
For a good example of making meaningful use of governed data in these very trying and uncertain times, have a look at Our World in Data. The data is presented simply, methods and assumptions are explained clearly (and those explanations are front-and-center along with any analysis), and the work is consistently improved by soliciting and incorporating both lay and expert feedback.
Governed data doesn't mean complete data, or perfectly accurate data, or even "good" data. It means we know where it came from, who collected it, and under what circumstances. It also means we know where there are gaps or inconsistencies, when assumptions and transformations have been applied, and what are and are not reasonable uses.
Even under the best of conditions, it is prudent to avoid drawing far-ranging conclusions from the data we have collected, or to expect any of our analyses to be dispositive. In any circumstances, however, our use of data becomes vastly more effective when we have shared understanding, accessible documentation, repeatable processes, and an environment that supports improving and enhancing all of these.
Our best wishes are with those medical personnel who are treating this disease, and with those essential service providers who are working to keep those of us staying at home safe. (This is one area where there is sufficient data to draw reasonable conclusions: social distancing works. And it's an area where we can all do our part!)
But we also want to extend kudos to those who continue to maintain high data management standards in their collection, analysis, and publication of data, and who resist calls for unsupported conclusions or unsupported speculation.
Be safe, everyone. Show your work, double-check your calculations, and flatten that curve!
And as a reminder, IData has a solution, the Data Cookbook, that can aid the employees and the organization in its data governance, data stewardship and data quality initiatives. IData also has experts that can assist with data governance, reporting, integration and other technology services on an as needed basis. Feel free to contact us and let us know how we can assist.
Photo Credit: StockSnap_C2799653AAEdited_NoTime_BP #1109