DNA databases are too white. This man aims to fix that.

Carlos D. Bustamante’s hunt for genetic variations between populations should help us better understand and treat disease.

David Rotmanarchive page

October 15, 2018

In the 15 years since the Human Genome Project first exposed our DNA blueprint, vast amounts of genetic data have been collected from millions of people in many different parts of the world. Carlos D. Bustamante’s job is to search that genetic data for clues to everything from ancient history and human migration patterns to the reasons people with different ancestries are so varied in their response to common diseases.

Bustamante’s career has roughly spanned the period since the Human Genome Project was completed. A professor of genetics and biomedical data science at Stanford and 2010 winner of a MacArthur genius award, he has helped to tease out the complex genetic variation across different populations. These variants mean that the causes of diseases can vary greatly between groups. Part of the motivation for Bustamante, who was born in Venezuela and moved to the US when he was seven, is to use those insights to lessen the medical disparities that still plague us.

But while it’s an area ripe with potential for improving medicine, it’s also fraught with controversies over how to interpret genetic differences between human populations. In an era still obsessed with race and ethnicity—and marred by the frequent misuse of science in defining the characteristics of different groups—Bustamante remains undaunted in searching for the nuanced genetic differences that these groups display.

Perhaps his optimism is due to his personality—few sentences go by without a “fantastic” or “extraordinarily exciting.” But it is also his recognition as a population geneticist of the incredible opportunity that understanding differences in human genomes presents for improving health and fighting disease.

David Rotman, MIT Technology Review’s editor at large, discussed with Bustamante why it’s so important to include more people in genetic studies and understand the genetics of different populations.

How good are we at making sure that the genomic data we’re collecting is inclusive?

I’m optimistic, but it’s not there yet.

In our 2011 paper, the statistic we had was that more than 96% of participants in genome-wide association studies were of European descent. In the follow-up in 2016, the number went from 96% to around 80%. So that’s getting better. Unfortunately, or perhaps fortunately, a lot of that is due to the entry of China into genetics. A lot of that was due to large-scale studies in Chinese and East Asian populations. Hispanics, for example, make up less than 1% of genome-wide association studies. So we need to do better. Ultimately, we want precision medicine to benefit everybody.

Aside from a fairness issue, why is diversity in genomic data important? What do we miss without it?

First of all, it has nothing to do with political correctness. It has everything to do with human biology and the fact that human populations and the great diaspora of human migrations have left their mark on the human genome. The genetic underpinnings of health and disease have shared components across human populations and things that are unique to different populations.

How does that play out?

Diabetes is a great example. If we look at the genetics of diabetes, they are different in different parts of the world. In the early 2010s, the Broad [Institute of MIT and Harvard] did a study with the National Institute of Genomic Medicine in Mexico to study the genetics of diabetes. Sure enough, they found a genetic variant that has a 25% frequency in Mexico that you don’t see in European, East Asian, or African populations. It is largely seen only in the Americas, and it underscores a large part of ethnic disparity in diabetes.

“We can’t use genetics for the purpose of trying to define the stories we tell about ourselves.”

We’ve done research on seemingly innocuous traits like blond hair. There is no more striking phenotype. Some people have blond hair and some people don’t. And the cause of blond hair in Melanesia is completely different from the cause in Europe—and that’s blond hair. So why do you think diabetes, heart disease, all these other complex traits will have identical causes in all humans? It doesn’t make sense.

It turns out the highest prevalence of asthma [in the US] is in individuals of Puerto Rican ancestry, followed by individuals of African-American ancestry, followed by European ancestry. The people with the lowest rate of asthma are those of Mexican ancestry. You have two of the Hispanic populations at the opposite ends of the spectrum.

Why is detailing these genetic differences helpful for medicine?

If the genetic etiology of disease is different, it gives us an opportunity to discover new drug targets. It gives us new biology that then can be used even for those that don’t necessarily suffer from the disease in that way. It’s important for drug discovery. If you think of it like looking for oil, we’ve only been looking for oil in the North Sea. There are plenty of other places to search, and that benefits everyone.

Secondly, we’re finding that polygenic risk scores [disease-risk predictions based on genetic tests] for European ancestry don’t translate easily into other populations. If we don’t have broad representation in medical and population genetics, then we run the risk of widening health disparities, which will be a terrible outcome for precision medicine and precision health.

So aren’t you disappointed by the lack of progress in including more populations in genomic data?

I’m actually super-excited. We’ve done a great job of mining for drug targets in Europe. Iceland led the way, Britain led the way, and now Finland. So we’re tapping all those resources—awesome. But what about Latin America? What about Africa? What about South Asia? All of those places have tons to contribute to our understanding of health and disease.

It is both a moral obligation and a missed scientific opportunity if we don’t go to work in those populations.

Many genetic researchers have long argued that race has no basis in science. But the debate doesn’t seem to go away.

In a global context there is no model of three, or five, or even 10 human races. There is a broad continuum of genetic variation that is structured, and there are pockets of isolated populations. Three, five, or 10 human races is just not an accurate model; it is far more of a continuum model.

Humans are a beautifully diverse species both phenotypically and genetically. This is very classic population genetics. If I walk from Cape Horn all the way to the top of Finland, every village looks like the village next to it, but at the extremes people are different.

But as a population geneticist?

I don’t find race a meaningful way to characterize people.

You walk a tricky line, though, don’t you? You’re pointing out the importance of variance between different populations, but you don’t want to reinforce old categories of race.

We can’t use genetics for the purpose of trying to define the stories we tell about ourselves. Social determinants of health are often far more important than genetic determinants of health, but that doesn’t mean genetic determinants aren’t important. So you’ve got to embrace the complexity and figure out how we translate this to a broad general public.

I’m actually an optimist. I think the world is becoming a less racist place. If you talk to the next generation of people, millennials on down, those abhorrent ideologies are thrown away. That means it gives us a space to now think about what role does genetics play in health and diseases and human evolution in ways that we can soberly understand and bring to bear on important problems.

We can’t allow genetics to get hijacked by identity politics. If you begin to allow politics and other interests to come in, you just muddy the waters. You need to let the data lead. You need to let outcomes lead. And the rest will follow.

Data bias in dna studies

Precision medicine is getting more precise for some but leaving many others behind. And those left behind are often people with Latin American, African, Native American, and other ancestries that are underrepresented in genomic databases.

By far, most of the data in genome-wide association studies, which have been critical in spotting genetic variants tied to common diseases, comes from people with European ancestry. In 2011, Carlos D. Bustamante and his colleagues called out the disparities and the resulting threat that genomic medicine “will largely benefit a privileged few.” In subsequent years, the collection of genomic data has exploded, but the disparities remain. In 2016, Alice Popejoy, who was a PhD student at the University of Washington and is now a postdoc in Bustamante’s lab, updated the results in the journal Nature, finding little progress for most population groups.

One result of this lack of data is that genetic tests may be less relevant and accurate for people from underrepresented groups. Increasingly popular consumer genetic tests can be misleading or just plain wrong, and medical genetic tests for some common diseases are often inconclusive. Likewise, Popejoy says, false positives and false negatives in genetic diagnoses are more common in people with non-European ancestry, because the results are interpreted using databases that are incomplete or biased toward European ancestry.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.