A little-known AI method can train on your health data without threatening your privacy

Machine learning has great potential to transform disease diagnosis and detection, but it’s been held back by patients’ reluctance to give up access to sensitive information.

Karen Haoarchive page

March 11, 2019

John Moore | Getty

In 2017, Google quietly published a blog post about a new approach to machine learning. Unlike the standard method, which requires the data to be centralized in one place, the new one could learn from a series of data sources distributed across multiple devices. The invention allowed Google to train its predictive text model on all the messages sent and received by Android users—without ever actually reading them or removing them from their phones.

Despite its cleverness, federated learning, as the researchers called it, gained little traction within the AI community at the time. Now that is poised to change as it finds application in a completely new area: its privacy-first approach could very well be the answer to the greatest obstacle facing AI adoption in health care today.

“There is a false dichotomy between the privacy of patient data and the utility of the data to society,” says Ramesh Raskar, an MIT associate professor of computer science whose research focuses on AI in health. “People don’t realize the sand is shifting under their feet and that we can now in fact achieve privacy and utility at the same time.”

Over the last decade, the dramatic rise of deep learning has led to stunning transformations in dozens of industries. It has powered our pursuit of self-driving cars, fundamentally changed the way we interact with our devices, and reinvented our approach to cybersecurity. In health care, however, despite many studies showing its promise for detecting and diagnosing diseases, progress in using deep learning to help real patients has been tantalizingly slow.

Current state-of-the-art algorithms require immense amounts of data to learn—in most cases, the more data the better. Hospitals and research institutions need to combine their data reserves if they want a pool of data that is large and diverse enough to be useful. But especially in the US and the UK, the idea of centralizing reams of sensitive medical information in the hands of tech companies has repeatedly—and unsurprisingly—proved intensely unpopular.

As a result, research on diagnostic uses of AI has stayed narrow in scope and applicability. You can’t deploy a breast cancer detection model around the world when it’s only been trained on a few thousand patients from the same hospital.

All this could change with federated learning. The technique can train a model using data stored at multiple different hospitals without that data ever leaving a hospital’s premises or touching a tech company’s servers. It does this by first training separate models at each hospital with the local data available and then sending those models to a central server to be combined into a master model. As each hospital acquires more data over time, it can download the latest master model, update it with the new data, and send it back to the central server. Throughout the process, raw data is never exchanged—only the models, which cannot be reverse-engineered to reveal that data.

There are some challenges to federated learning. For one, combining separate models risks creating a master model that’s actually worse than each of its parts. Researchers are now working on refining existing techniques to make sure that doesn’t happen, says Raskar. For another, federated learning requires every hospital to have the infrastructure and personnel capabilities for training machine-learning models. There’s also friction in standardizing data collection across all hospitals. But these challenges aren’t insurmountable, says Raskar: “More work needs to be done, but it’s mostly Band-Aid work.”

In fact, other privacy-first distributed learning techniques have since cropped up in response to these challenges. Raskar and his students, for example, recently invented one called split learning. As in federated learning, each hospital starts by training separate models, but they only train it halfway. The half-baked models are then sent to the central server to be combined and finish training. The main benefit is that this would alleviate some of the computational burden on the hospitals. The technique is still mainly a proof of concept, but in early testing, Raskar's research team showed that it created a master model nearly as accurate as it would be if it were trained on a centralized pool of data.

A handful of companies, including IBM Research, are now working on using federated learning to advance real-world AI applications for health care. Owkin, a Paris-based startup backed by Google Ventures, is also using it to predict patients’ resistance to different treatments and drugs, as well as their survival rates with certain diseases. The company is working with several cancer research centers in the US and Europe to utilize their data for its models. The collaborations have already resulted in a forthcoming research paper, the founders say, on a new model that predicts survival odds for a rare form of cancer on the basis of a patient’s pathology images. The paper will take a major step toward validating the benefits of this technique in a real-world setting.

“I’m really excited,” says Owkin cofounder Thomas Clozel, a clinical research doctor. “The biggest barrier in oncology today is knowledge. It’s really amazing that we now have the power to extract that knowledge and make medical breakthrough discoveries.”

Raskar believes the applications of distributed learning could also extend far beyond health care to any industry where people don’t want to share their data. “In distributed, trustless environments, this is going to be very, very powerful in the future,” he says.

This story originally appeared in our AI newsletter The Algorithm. To have it directly delivered to your inbox, sign up here for free.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.