In the last few years, research has shown that deep learning can match expert-level performance in medical imaging tasks like early cancer detection and eye disease diagnosis. But there’s also cause for caution. Other research has shown that deep learning has a tendency to perpetuate discrimination. With a health-care system already riddled with disparities, sloppy applications of deep learning could make that worse.
Now a new paper published in Nature Medicine is proposing a way to develop medical algorithms that might help reverse, rather than exacerbate, existing inequality. The key, says Ziad Obermeyer, an associate professor at UC Berkeley who oversaw the research, is to stop training algorithms to match human expert performance.
The paper looks at a specific clinical example of the disparities that exist in the treatment of knee osteoarthritis, an ailment which causes chronic pain. Assessing the severity of that pain helps doctors prescribe the right treatment, including physical therapy, medication, or surgery. This is traditionally done by a radiologist reviewing an x-ray of the knee and scoring the patient’s pain on the Kellgren–Lawrence grade (KLG), which calculates pain levels based on the presence of different radiographic features, like the degree of missing cartilage or structural damage.
But data collected by the National Institute of Health found that doctors using this method systematically score Black patients’ pain as far as far less severe than what they say they’re experiencing. Patients self-report their pain levels using a survey that asks how much it hurts to do various things, such as fully straightening their knee. But these self-reported pain levels are ignored in favor of the radiologist’s KLG score when prescribing treatment. In other words, Black patients who show the same amount of missing cartilage as white patients self-report higher levels of pain.
This has consistently miffed medical experts. One hypothesis is that Black patients could be reporting higher levels of pain in order to get doctors to treat them more seriously. But there’s an alternative explanation. The KLG methodology itself could be biased. It was developed several decades ago with white British populations. Some medical experts argue that the list of radiographic markers it tells clinicians to look for may not include all the possible physical sources of pain within a more diverse population. Put another way, there may be radiographic indicators of pain that appear more commonly in Black people that simply aren’t part of the KLG rubric.
To test this possibility, the researchers trained a deep-learning model to predict patients’ self-reported pain level from their knee x-ray. If the resultant model had terrible accuracy, this would suggest that self-reported pain is rather arbitrary. But if the model had really good accuracy, this would provide evidence that self-reported pain is in fact correlated with radiographic markers in the x-ray.
After running several experiments, including some designed to discount any confounding factors, the researchers found that the model was much more accurate than KLG at predicting self-reported pain levels for both white and Black patients, but especially for Black patients. It reduced the racial disparity at each pain level by nearly half.
The goal isn’t necessarily to start using this algorithm in a clinical setting. But by outperforming the KLG methodology, it revealed that the standard way of measuring pain is flawed, at a much greater cost to Black people. This should tip off the medical community to investigate which radiographic markers the algorithm might be seeing, and update their scoring methodology.
“It actually highlights a really exciting part of where these kinds of algorithms can fit into the process of medical discovery,” says Obermeyer. “It tells us if there’s something here that’s worth looking at that we don’t understand. It sets the stage for humans to then step in and, using these algorithms as tools, try to figure out what’s going on.”
“The cool thing about this paper is it is thinking about things from a completely different perspective,” says Irene Chen, a researcher at MIT who studies how to reduce health-care inequities in machine learning and was not involved in the paper. Instead of training the algorithm on well-established expert knowledge, she says, the researchers chose to treat patients’ self-assessment as truth. Through that it uncovered important gaps in what the medical field usually considers to be the more “objective” pain measure.
“That was exactly the secret,” agrees Obermeyer. If algorithms are only ever trained to match expert performance, he says, they will simply perpetuate existing gaps and inequities. “This study is a glimpse of a more general pipeline that we are increasingly able to use in medicine for generating new knowledge.”