Skip to Content

Big Data Mining

A Q&A with one of the leading inventors of tools for medical data analytics.

Over the past decade, health-care providers have spent tens of billions of dollars to digitize their patients’ medical records. In theory this should be providing researchers with a treasure trove of data to dig through for evidence of the effectiveness and efficiency of care. In practice, it’s more complicated.

Data from these records can be hard to access and difficult to make sense of once it is in hand. Patient privacy issues and data security are of increasing concern and have yet to be fully addressed.

Isaac Kohane, co-director of the Center for Biomedical Informatics at Harvard Medical School, has spent the last 20 years working to pull meaning out of large sets of health data. A pediatrician with a PhD in computer science, Kohane mined medical data to discover the risk of heart attack for patients on one widely prescribed diabetes medicine, Avandia. After his study the drug was pulled off the market. His other research has identified early warning signs of domestic abuse and revealed the variations and patterns among patients with disorders such as autism. He spoke with senior editor Nanette Byrnes.

Has this multibillion-dollar investment in electronic health records led to better health care, or at least a better understanding of the quality of care?

You can’t have accountable care if you can’t count. But you’d be dismayed. If you ask any large health-care system, “How many patients do you have with this characteristic? How many patients of this kind did your doctors see? What was their average length of stay?” they will not know.

“I am concerned that it’s all too easy to see the data and say, ‘I’ve been doing big-data analysis for Target and now I can do it for medicine.’ That turns out not to be true. You really need to know something about medicine. If statistics lie, then big data can lie in a very, very big way.”

I do not believe it’s overly cynical to note that many electronic-health-record vendors have touted the ability to bill more effectively for care using electronic records than paper records. [Records that doctors submit to insurance companies] for reimbursement are obviously biased to maximize the income of the health-care system. It may not necessarily reflect the on-the-ground biological or clinical truth.

One oft-cited goal of medical analytics is to combine a patient’s health records with information from his or her genome to create a very precise kind of personalized medical care. But that also seems far off.

In addition to all the challenges of genomic data by virtue of its volume and complexity, no major electronic-health-record vendor supports it. A lot of electronic health records, if you look under the hood, are fairly antiquated. Even though they have a modern skin, they are really state-of-the-art 1980s technology, so integrating them with all the existing genomic tools is a very high bar. Even perhaps more important sometimes than the genome is knowing family history and knowing it in a structured way. But that is not done in most electronic health records, either. The bulk of our health-care data comes from [insurance] claims data and electronic health records, period. And maybe a little bit of public health data.

You and your colleagues have created two platforms with the idea that developers would write apps that could unlock what is in electronic health records.

I do not believe the answer is to tear down all these [electronic-health-record] dinosaurs. Work has gone into them, a lot of thought has gone into them, and you don’t want to rebuild all the back-end stuff. The apps give you modern functionality.

What’s an example?

A detailed family history is on average the most informative [information] for understanding inherited disease risk. Yet very few electronic health records provide the capability to easily enter a family history and link it to the broader genealogy of the family. There are several highly successful Web apps that are low-cost and yet allow entry of highly detailed family history—and by virtue of their market success allow linking of a small family’s history to a much larger genealogy. With a platform like ours you can adapt these modern Web apps to provide a legacy electronic health record with a state-of-the-art family history record.

Even with the technology to mine these records, you say doing that accurately can be tricky.

I am concerned that it’s all too easy to see the data and say, “I’ve been doing big-data analysis for Target and now I can do it for medicine.” That turns out not to be true. You really have to know something about medicine. If statistics lie, then big data can lie in a very, very big way.

When you are looking for adverse events in drugs given for diabetes, for example, it’s pretty tricky if one of the adverse events you are looking for is heart attacks, because heart attacks are also a result of poor diabetes care—the same reason for which the drug is being given. So consequently, if you just willy-nilly said “Just give me all the drugs with a high rate of heart attack,” of course all the diabetes drugs would light up. Instead what we did was say, “Let’s compare the different drugs that are used in the same way and belong to the same class of drug and see if we can see different rates of heart attack if we control for all the other aspects.” And sure enough, we found one such drug. It was called Avandia, and compared to another similar drug, it had a much higher heart attack rate.

There is a lot of concern that compiling databases of health records could result in personal information becoming public. Does that worry you?

The more I know about someone, the more I can do useful things for them, and the more I know about them the more I can discover. And the more you blind me to things, the less useful I’ll be. The only real protection is that the people who have the authorized use of the data have to understand what is the right code of conduct.

Keep Reading

Most Popular

How scientists traced a mysterious covid case back to six toilets

When wastewater surveillance turns into a hunt for a single infected individual, the ethics get tricky.

It’s time to retire the term “user”

The proliferation of AI means we need a new word.

The problem with plug-in hybrids? Their drivers.

Plug-in hybrids are often sold as a transition to EVs, but new data from Europe shows we’re still underestimating the emissions they produce.

Sam Altman says helpful agents are poised to become AI’s killer function

Open AI’s CEO says we won’t need new hardware or lots more training data to get there.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.