Getting Granular: HBS Students’ Ad Hoc Crowdsourcing Effort Leads to the Largest County-Level Covid-19 Dataset

Project Lead Cray V. Noah (MD/MBA ’22) details the rallying effort that turned a grassroots response to the need for granular Covid-19 policy data into a nationwide endeavor and a federally endorsed research tool.

When Covid-19 cases in the U.S. began to spike in March, the question “What is being done to stop the spread of coronavirus in my area?” resonated across communities. After Hikma Health, a biotech startup incubated at the Harvard i-lab, ran a hackathon aimed at addressing that issue, it was evident that Covid-19 policy data at a granular local level, despite being more representative than pre-existing state- or nation-level data, was virtually nonexistent. The need for this higher-resolution data intensified as the pandemic progressed and two things became clear. First, Covid-19 behaves very differently from county to county. Second, since vaccines and medications have lengthy lead times, policy interventions like shelter-in-place orders and public testing centers are the most effective and initially the only option to limit the spread of coronavirus.

In response, I, along with fellow HBS MBA students, teamed up with Hikma Health in an all-hands-on-deck spirit to initiate what would become a nationwide crowdsourcing effort involving more than a hundred volunteers, resulting in the first and largest Covid-19 county policy dataset in the nation. The novel dataset has quickly become a valuable research asset for policymakers, epidemiologists, economists, and other researchers/modelers across the nation and is accompanied by an interactive map for empowering people, regardless of scientific background, to interact directly with raw Covid-19 data.

Why Crowdsourcing?

The adage “all politics is local” has proved true during the pandemic as there has been immense variation at the county level on multiple fronts. For one, counties are vastly different when it comes to which Covid-19 policies they choose to adhere to. While many state governors have ordered policy interventions throughout the pandemic, counties ultimately decide if and when they take effect, with some counties having policies in place long before state-level decrees are issued and other counties choosing to follow certain decrees after they are issued. In addition, there is significant variation in how counties report and update their residents on standing Covid-19 policies, ranging from websites, to local news outlets, to social media like Twitter and Facebook.

This immense variation from multiple angles simultaneously makes county-level policy data much more representative of the current status, but also much harder to collect. At the start of the project, we actually tried to automate county policy data collection as has been done at the state-level–where Covid-19 policy information/governor orders organized on state websites allow for automated data collection by AI bots–but the many inter-county differences made it impossible. We quickly realized that building out the first county-level Covid-19 policy dataset would require many hours of human deciphering and discernment. The project would have stopped there, but thanks to HBS MBA students rising to the challenge, that’s where it took off.

Project Trajectory

While HBS MBA students across classes 2020-2022 have contributed at all stages of the project, they were most importantly the key group in the beginning. These initial volunteers researched and collected granular county policy data to build out the dataset/map we then used as a prototype/visual for pitching the opportunity to garner further support. From there, the HBS network helped the project really catch fire. MBA students across years leveraged connections with research groups, faculty, and other institutions to get valuable feedback from experts and help disseminate the volunteer opportunity to organizations like MBAs Fight Covid-19 (also started at HBS) and National Student Response Network, email listservs and Slack channels. As word spread through the tight-knit HBS community, we received tremendous and unexpected levels of inbound interest from willing volunteers.

What started as a small grassroots project with a handful of HBS MBA students and a goal of collecting data on 100 counties has rapidly grown to a nationwide crowdsourcing effort of 130 volunteers spanning all Harvard University schools and graduate programs from over 20 other academic institutions, leading to a research caliber dataset covering 1,320 U.S. counties and 154 Native American communities, seed funding to continue the project, and a recent inbound request from the U.S. Department of Health and Human Services (HHS) to incorporate the dataset into the national Covid-19 reporting system. Seeing the many HBS MBA students rally together as volunteers to contribute in non-flashy and often tedious ways, especially in such uncertain, stressful times, has been so inspiring. This dataset is unique because of not only the novel data it provides but also the thousands of human workhours put in to make it happen.

As the project lead, I was in charge of recruiting and managing volunteers, organizing the data on the back end, and establishing standardized protocols. This became a full-time job and completely shifted my summer plans. Working with my driven classmates from HBS who saw the bigger picture along with me and voluntarily put in many hours for the cause helped me understand the character of people who come to HBS and made every minute worth it. In March, the project did not look like it was going to get off the ground but now the fact that the project is still running and research groups across the nation are using the dataset to inform policy decisions has been both humbling and validating.

Research Impact

The ultimate goal of this project has been to fill a gap in Covid-19 data. By creating a higher resolution dataset, we have enabled policymakers, epidemiologists, and economists to perform more representative and applicable research that will inform local policy decisions to come. Initially, many of the experts we reached out to were optimistic but skeptical that the dataset would become big enough to do reliable analysis since we were relying on volunteer crowdsourcing. But once the project caught fire and surpassed 1,000 counties, researchers were impressed by the amount and quality of data and inbound interest from around the country really picked up, and I was encouraged to publish the protocol for others to replicate if and when rapid and zero-cost data collection is needed later on in this pandemic or future public health crises.

The final product, the first and largest granular Covid-19 policy dataset, contains time series data on if and when seven key policy interventions were implemented or discontinued in 1,320 U.S. counties, is completely free/open-source, and is already becoming the preferred policy dataset over more general state-level data. It has been accessed thousands of times and is currently being used by researchers/modelers at government agencies, hospitals and universities around the country to inform Covid-19 policy next steps. To name a few current examples, local groups at Mass General and Brigham and Women’s Hospitals are coupling our county-level policy dataset with Covid-19 infection and population datasets to analyze which combination of policy interventions work best for specific counties given their immense differences; an economics group at Amherst College is using our dataset to find links between Covid-19 responses and rugged individualism; and Emory University epidemiologists are using the dataset to perform natural experiments quantifying efficacy of the policy interventions used in the U.S. and optimal timing–all of this is important information for the local, state, and federal levels for not only mitigating impact from this pandemic but also preparing plan-of-attacks for future pandemics or other public health emergencies.

Other research groups at Harvard, Stanford, and Columbia Universities have notified us that our dataset is helping them discover correlations between Covid-19 fatalities and different demographic, socio-economic, and political characteristics. Such findings were not possible with only state-level policy analysis, and our dataset can be coupled with other data sources to perform unique and more detailed analyses on factors like demographic disparities, political alignment or population density in relation to local Covid-19 policies

Next Steps

While the extraordinary volunteer crowdsourcing effort initiated by HBS MBA students exceeded our expectations, we need to secure more funding to further scale this unique resource. The Covid-19 research seed funding we recently received is being used to “double-code” the 1,320 counties in the dataset (the gold standard of data validation for crowdsourcing in which multiple people research the same outcome and reconcile any response differences). Once the dataset is fully double-coded, the hope is that the U.S. Department of Health & Human Services and other outlets will endorse and post this valuable county-level dataset/map, so it gets into the hands of as many people and research groups as possible.

If the project were to stop there, it would be a tremendous success and valuable research tool for years to come, representing the trajectory of U.S. county policies during the Covid-19 pandemic from March through August of 2020. However, further monetary and in-kind support from the government or any other source would take this project to the next level. We are looking for different forms of support, such as keeping the dataset dynamically updated on a weekly basis, collecting data on more policies, expanding beyond 1,320 counties, and hiring software engineers/staff to create a standalone website to serve as a hub for all local-level Covid-19 data and real-time analysis. Given the project’s track record and that I will be further plugging into the HBS community starting the MBA curriculum this Fall, there’s no telling where it could go.

Contribute/Contact

If you would like to contribute to this project in any way, whether it be through funding, data collection, spreading the opportunity, sharing/using the dataset for research and/or the interactive map for a medium to interact with raw Covid-19 data, please email covidpolicies2020@gmail.com.

Links to the completely free and open-source dataset and associated interactive map:

Dataset https://github.com/hikmahealth/covid19countymap

Map https://www.hikmahealth.org/map


Cray V. Noah (MD/MBA ’22) is an engineer and a doctor-in-training dedicated to innovating and extending the reach of medical technology. As a fourth-year student in Harvard’s MD/MBA dual degree program, he has started his RC year at HBS this Fall. Noah currently works at the nexus of biomedicine and business as lead engineer on multiple patented medical devices originating from his time at Georgia Tech, a transplant surgery researcher at Mass General Hospital, Managing Director of Data Projects at digital health startup Hikma Health Inc., and active Board and Founding Member of nonprofit Medicine in Motion Inc. A native Texan, Noah has transitioned from football to tennis, triathlons and sailing since moving northeast.