It took a pandemic, but the US finally has (some) centralized medical data
Covid exposed the fragmented reality of US health records. Now an effort to bring together data from millions of patients is starting to show results.
- 6.3 million de-identified records are currently in NIH’s N3C database
- It’s become one of the largest collections of covid patient records in the world
- The scheme avoids data silos and privacy issues that riddle US health-care system
Throughout the pandemic, there has been serious tension between what the public wants to know and what scientists have been able to say for certain.
Scientists have been able to learn more about covid, faster, than about any other disease in history—but at the same time, the public has been shocked when doctors can’t answer seemingly basic questions: What are the symptoms of covid-19? How does it spread? Who’s most susceptible? What’s the best way to treat it?
Nowhere has this conflict been more clear than in the US, which spends nearly a fifth of its gross domestic product on health care but achieves worse outcomes than any other wealthy country. Finding the answers has been complicated not just because the science is hard, but because American health care is built on a patchwork of incompatible, archaic systems.
Across the nation, federal, state, and local privacy laws overlap and sometimes contradict one another. Medical records, meanwhile, are messy, fragmented, and intensely siloed by the institutions that own them—both for privacy reasons and because selling de-identified medical data is incredibly profitable.
But accessing data trapped in these silos is the only way to answer questions about covid. That’s why so much vital research has been done abroad, in countries with national health care systems, even though the US has a huge number of both covid patients and research institutions. Some of the strongest data on risk factors for covid mortality and features of long covid have come from the UK, for example. There, public health researchers have access to data from 56 million NHS patients’ medical records.
At the beginning of the pandemic, a group of researchers funded by the US National Institutes of Health, or NIH, realized that many questions about covid-19 would be impossible to answer without breaking down barriers to data sharing. So they developed a framework for combining actual patient records from different institutions in a way that could be both private and useful.
The result is the National COVID Cohort Collaborative (N3C), which collects medical records from millions of patients around the country, cleans them, and then grants access to groups studying everything from when to use a ventilator to how covid affects menstrual cycles.
“It’s just shocking that we had no harmonized, aggregate health data for research in the face of a pandemic,” says Melissa Haendel, a professor of research informatics at the University of Colorado Anschutz Medical Campus and one of the co-leads of N3C. “We never would have gotten everyone to give us this degree of data outside the context of a pandemic, but now that we’ve done it, it’s a demonstration that clinical data can be harmonized and shared broadly in a secure way, and a transparent way.”
The database is now one of the largest collections of covid records in the world, with 6.3 million patient records from 56 institutions and counting, including records from 2.1 million patients with the virus. Most records go back to 2018, and contributing organizations have pledged to keep updating them for five years. That makes N3C not just one of the most useful resources for studying the disease today, but one of the most promising ways to study long covid.
A system where institutions send records, in bulk, to a centralized federal government is an anomaly in American health care. Put to good use, it has the potential to answer detailed questions long after the pandemic. And it may even serve as proof of concept for similar efforts in the future.
To contribute information to the database, participating providers first pick two groups of patients: people who have tested positive for covid, and others who will serve as a control group. They then strip out everything that makes the data personally identifiable, except zip code and dates of service, and transmit it securely to N3C. There, technicians clean the data—not always an easy task—and enter it into the database.
Anyone can submit a research proposal through N3C’s dashboard, whether affiliated with a submitting institution or not. Even citizen scientists can request access to an anonymized version of the data set.
An NIH committee reviews each proposal and decides which version of the data researchers will be able to access. There are several tiers of information: a limited data set, a second level containing real records with zip codes and dates obscured, and a third made of computer-generated “synthetic” records, which attempt to keep the same attributes as the real records without containing any real patient data. Everyone has to go through data security training before gaining access.
So far 215 research projects have been approved, including studies to track outcomes for patients who have received different covid vaccines and examine the complication rates of elective surgeries in non-covid patients during the pandemic. The first publication from the collaborative was an analysis of mortality risk factors in cancer patients who contracted SARS CoV2, and several pre-prints have been released on topics including covid outcomes in liver disease patients and people with HIV.
More accountability, better science
Clean, accurate data is vital to such studies, but it’s been tough to come by in the chaos of the pandemic. Last June, two major journals, the BMJ and The Lancet, retracted papers based on “data” from Surgisphere, a little-known medical data company with a handful of employees. It claimed to have access to real-time medical records from nearly 100,000 covid patients in 700 hospitals around the world. In some cases the numbers represented more patients than had actually been diagnosed in a given country.
Before being retracted, the papers led to decisions to halt clinical trials and alter medical practices. But when researchers became suspicious—particularly given that even a single agreement on medical data transfer takes enormous time and labor—the company refused to let anyone audit the data. In fact, there’s no proof the database ever existed.
N3C, on the other hand, is auditable by, and accountable to, thousands of researchers at hundreds of participating institutions, with a strong focus on transparency and reproducibility. Everything users do through the interface, which uses Palantir’s GovCloud platform, is carefully preserved, so anyone with access can retrace their steps.
“This isn’t rocket science, and it isn’t really new. It’s just hard work. It’s tedious, it has to be done carefully, and we have to validate every step,” says Christopher Chute, a professor of medicine at Johns Hopkins who also co-leads N3C. “The worst thing we could do is methodically transform data into garbage that would give us wrong answers.”
Haendel points out that these efforts haven’t come easy. “The diversity in expertise that it took to make this happen—the perseverance, dedication, and, frankly, brute force—is just unprecedented,” she says.
That brute force has come from many different fields beyond just medicine.
“Having everyone on board from all aspects of science really helped. During covid people were much more willing to collaborate,” says Mary Boland, a professor of informatics at the University of Pennsylvania. “You could have engineers, you could have computer scientists, physicists—all these people who might not normally participate in public health research.”
Boland is part of a group using the N3C data to look at whether covid increases irregular bleeding in women with polycystic ovarian syndrome. Ordinarily, most researchers have to use insurance claims data to get a large enough database for population-level analyses, she says.
Claims data can answer some questions about how well drugs work in the real world, for instance. But those databases are missing huge amounts of information, including lab results, the symptoms people are reporting, and even data on whether patients survive or die.
Collecting and cleaning
Outside of insurance claims databases, most health data collaboratives in the US use a federated model. Participants in these studies all agree to format their own data sets in a common format and then run queries from the collective, such as the proportion of serious covid cases by age group. Several international covid research collectives, including the Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”), operate this way, avoiding legal and political problems with cross-border patient data.
OHDSI, which was founded in 2014, has researchers from 30 countries and holds records for 600 million patients.
“That allows each institution to keep their data behind their own firewalls, with their own data protections in place. It doesn’t require any patient data to move back and forth,” says Boland. “That’s comforting for a lot of places, especially with all the hacking that’s been going on lately.”
But relying on each institution to prepare its own data for such a system carries a lot of risks.
“Getting data into a common data format is the biggest challenge, because even medication names—you’d think that would be standardized across the US, but it’s really not,” says Boland. “Pharmacies will often have their generic drug, and it might have slightly different ingredients because of patent laws. Each of those is its own drug name.”
N3C, on the other hand, asks all participants to send their raw, messy records to one place and let the central body clean it up and standardize it. While there are many obvious benefits, there are significant legal and social obstacles to participating this way, both in America and internationally; many institutions, for instance, can’t contribute to N3C because of privacy laws in their states.
It’s also technologically challenging. Combining even two sets of electronic medical records is extremely difficult and labor intensive; the quality of data is often low, and there’s little standardization. In multi-site health-care organizations, as many as 1 in 5 medical records are duplicate files, mostly as a result of data entry screw-ups during appointments or check-ins, according to a 2018 Pew paper.
Those defending federated models often claim they do their own quality control behind their firewall. But N3C researchers were shocked to find out just how messy the data was.
“There was a certain amount of skepticism from sites, like, ‘We don’t really need this kind of data quality framework—we already do that at our own sites confidentially, behind our firewall. We don’t need your stinking harmonization tools,’” says Haendel. “But we learned those quality measures are insufficient when you look at data in aggregate.”
Some of the data quality problems have bordered on the absurd.
“In some cases, organizations have failed to put in units of measure. So there was a weight, but there was no unit, like we were just supposed to know,” says Chute. But having such a huge number of records gave them an advantage, and let them save many data points that otherwise would have been thrown out.
“We were able to look at the distributions of data for which we did have units, and see where the mystery data fit,” he says. “You can just eyeball it—oh, this is obviously pounds or kilograms.”
A big fish in a much bigger ocean
As extensive as it is, the N3C database is dwarfed by the scale of data collected and maintained elsewhere in the US health-care system, from government agencies to hospitals, testing labs, insurers, and others. The Department of Health and Human Services tracks more than 2,000 health-related data sets from federal, state, and local agencies alone.
The usefulness of each is limited by siloing: it’s essentially impossible for researchers working on their own to connect Medicare claims, records from vaccine registries, states’ racial and ethnicity data for vaccinations, or databases on covid-19 variants sequenced from patient samples around the country. Indeed, turning raw records into useful information is so challenging it’s become a thriving private industry: data brokers buy de-identified records in bulk, analyze correlations between variables, and sell their analyses—or the data itself—to researchers and governments.
“We’re willing to give all our data to a commercial entity and let them sell it back to us, but we’re unwilling to pay for the most basic public health infrastructure,” says Haendel. “This volunteer effort in the face of a pandemic is amazing, but it’s not a sustainable long-term solution for dealing with future pandemics, or just health care in general.”
The N3C approach steers away from some of those problems, but there are significant holes in its data, notably information on vaccinations. Most vaccines are being administered at community sites, while the collaborative’s records are from primary-care visits and hospitalizations, which means that just 245,000 Pfizer vaccines and 104,000 Moderna vaccines have been captured in the records. A health-care analytics company is building a tool to securely integrate patient records from multiple sources, but it won’t be available for at least a few months.
Even with those gaps, though, N3C’s enormous database offers one of the best resources for researchers looking to answer the many unsolved questions about covid.
“That’s kind of where we’re stuck now,” says Haendel. “We really need domain experts in all different aspects of clinical care, and the science behind them, to help us find all the needles in haystacks.”
Editor's note: An earlier version of this story incorrectly identified the committee that reviews N3C data use proposals. It is part of the NIH, not Johns Hopkins.
This story is part of the Pandemic Technology Project, supported by the Rockefeller Foundation.
The inside story of how ChatGPT was built from the people who made it
Exclusive conversations that take us behind the scenes of a cultural phenomenon.
How Rust went from a side project to the world’s most-loved programming language
For decades, coders wrote critical systems in C and C++. Now they turn to Rust.
Design thinking was supposed to fix the world. Where did it go wrong?
An approach that promised to democratize design may have done the opposite.
Sam Altman invested $180 million into a company trying to delay death
Can anti-aging breakthroughs add 10 healthy years to the human life span? The CEO of OpenAI is paying to find out.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.