Across the nation, federal, state, and local privacy laws overlap and sometimes contradict one another. Medical records, meanwhile, are messy, fragmented and intensely siloed by the institutions that own them—both for privacy reasons, and because selling de-identified medical data is incredibly profitable.

But accessing data trapped in these silos is the only way to answer questions about covid. That’s why so much vital research has been done abroad, in countries with national healthcare systems, despite the US having a huge number of both covid patients and research institutions. Some of the strongest data on risk factors for covid mortality and features of long covid have come from the UK, for example, where public health researchers have access to data from 56 million NHS patients’ medical records.

At the beginning of the pandemic, a group of researchers funded by the US National Institutes of Health, or NIH, realized that many questions about covid-19 would be impossible to answer without breaking down barriers to data sharing. So they developed a framework for combining actual patient records across institutions in a way that could be both private and useful.

The result is the National COVID Cohort Collaborative (N3C), which collects medical records from millions of patients around the country, cleans them, and then grants access to groups studying everything from when to use a ventilator to how covid affects women’s periods.

“It’s just shocking that we had no harmonized, aggregate health data for research in the face of a pandemic,” says Melissa Haendel, Professor of Medical Informatics at the Oregon Health & Science University and one of the co-leads of N3C. “We never would have gotten everyone to give us this degree of data outside the context of a pandemic, but now that we’ve done it, it’s a demonstration that clinical data can be harmonized and shared broadly in a secure way, and a transparent way.”

The database is now one of the largest collections of covid records in the world, with 6.3 million patient records from 56 institutions and counting, including records from 2.1 million patients with the virus. Most records go back to 2018, and contributing organizations have pledged to keep updating them for five years. That makes N3C not just one of the most useful resources for studying the disease today, but one of most promising ways to study long covid.

Institutions sending records, in bulk, to a centralized federal government is an anomaly in American healthcare. Put to good use, it has the potential to answer detailed questions long after the pandemic. And it may even serve as proof of concept for similar efforts in the future.

Open-source data

To contribute information to the database, participating providers first pick two groups of patients: people who have tested positive for covid, and those who will serve as a control group. They then strip out everything that makes the data personally identifiable, except zip code and dates of service, and transmit it securely to N3C. There, technicians clean the data—not always an easy task—and put it into the database.

Anyone can submit a research proposal through N3C’s dashboard, whether or not they’re affiliated with a submitting institution. Even citizen scientists can request access to an anonymized version of the dataset.