Author’s note: charts have been updated after the first posting to reflect recent data.
Raw numbers are horrific. Per capita statistics show a more complete picture of the distribution of COVID cases, but they may not be the best measure either. For non-communicable conditions — cancer, cataracts, constipation — “X cases per Y people” does provide a useful metric for geographic comparison. But COVID is highly transmittable. New York City broke out faster and to a greater extent than [fill-in-the-blank rural city in rural state]. Is per capita comparison really apples to apples, or is it necessary to include population density by measuring “X cases per Y people per Z area”?
To test this I first looked at a list of 268 countries, territories, or geographic regions. For each date, I computed the correlation between COVID cases and population densities. (It would be bad practice to compute one total correlation over all dates because of serial correlations within each region. Plus, it’s more interesting to see the evolution.) Early on there are only a handful of areas with outbreaks, so I cut out dates before there were 20 observed deaths, and the series quickly converges to around 250 regions included for each date.
Would you look at this! This data suggests that population density and number of COVID cases actually are negatively correlated! When population density increases, the relative number of cases is actually expected to decrease. While the correlations are statistically significant — a high confidence they are close to -0.02 — the practical significance is painfully trivial. Case closed…
…but there’s more to the story. The average population density over an entire country isn’t particularly useful, because the average blurs the details: maybe not for Monaco or Mongolia, but certainly for the United States, China, and the majority of the other 264 countries and regions analyzed. I offer another look at the correlation between population density and COVID case count.
I rectify the issue by obtaining the data for each county in the United States. Using ten times the number of zones in just a fraction of the total area of the globe avoids the aforementioned Flaw of Averages. This level of granularity suggests a far different story.
Significant correlations — statistical and practical — exist. The downward trend suggests the myriad of factors that affect the number of cases beyond population density, such as the extent and effectiveness of lockdown measures. Living in a dense area used to carry a much higher risk than living in a sparse area, but that’s changing. From this data alone it’s unclear if this is due to cities getting better or rural areas getting worse.
My mom enjoys watching the news after long days in her healthcare job. At times I hear the reporters blurt a number of new cases for the day, but I can never remember yesterday’s numbers. If I talk to my co-workers across the globe, comparison requires a quick google search for population and a subsequent division. In the United States, at least, a more accurate comparison would also include information on population density. For now, though, I won’t ask too much: I’d be happy if the nightly news would just show a history of cases using a line.
You can find a full list of data sources, computations, and decisions in my github repository. Please fork the repository and make modifications, but keep me in the loop — I’d love to see your ideas!
And Rob Hyndman’s response to that thread.