Rebooting the Human Genome

The official map of the human genome can’t tell you everything about your genes. Does graph theory offer a better way?

Antonio Regaladoarchive page

June 3, 2015

The Human Genome Project was one of mankind’s greatest triumphs. But the official gene map that resulted in 2003, known as the “reference genome,” is no longer up to the job.

Image courtesy of University of California, Santa Cruz

So say scientists laying plans for a new universal map they say will combine the genomes of hundreds, and eventually thousands, of people to create a true reference that reflects all of humanity.

The problem with the existing gene map is that it represents only one way a person’s genome could look. The new map, called a “graph genome,” or pan-genome, would use mathematics to reflect every possible twist or turn a person’s genome could take as it spirals around 46 chromosomes.

“It’s very new technology. But in less than five years, everyone is going to be using it,” predicts Gabor Marth, a geneticist at the University of Utah.

The math behind the idea is graph theory. You are already familiar with graphs if you know about the “Six Degrees of Kevin Bacon” game. Every actor is a node and if they acted in a movie together, that’s an edge. The game is to find the smallest number of edges it takes to reach Kevin.

In a graph genome, the objective will be to find a path through the genetic letters that exactly matches yours. If every possible path is represented—which is the whole idea—it will make the interpretations of genomes faster, less expensive, and more accurate.

Fewer than 250,000 people’s genomes have ever been sequenced, but that figure is set to double each year as genome sequencing becomes a routine way to diagnose disease at children’s hospitals and cancer centers. Some expect that eventually every newborn will have its genome decoded.

Accurately determining how each of those differs from the others is where the new gene map would come in. Marth’s Utah lab is one of several academic teams working on prototypes of genome graphs that will be submitted to a standards body in June.

There’s commercial interest, too. The genetics company 23andMe is developing graphs and companies like Google are watching closely, says Benedict Paten, a researcher at the University of California, Santa Cruz. “Everyone is interested in having humanity represented in one fundamental data structure,” he says.

The problem with the current reference isn’t only that it still has gaps, or that it includes DNA bolted together from a dozen or so anonymous people, Frankenstein fashion. It’s that, as a single account of the three billion letters that make up human genes, it can’t readily be used to say how yours might differ.

The trouble arises when a new genome is sequenced, using speedy machines that shred DNA into millions of small bits. To reassemble these, the bits are lined up against the reference, which, like the picture on the front of a puzzle box, acts as a guide. But typically about 5 percent of a person’s DNA data won’t fit anywhere.

“The more different you are from the reference, the harder it is to fit in all the pieces,” says David Mittelman, chief scientific officer of Tute Genomics, a bioinformatics company. “And if you can’t find the differences, you can’t find the risk.”

The international Genome Reference Consortium that maintains the reference has tried to keep up, plastering it with scientific sticky notes. These “alternative” sequences, many connected to the human immune system, now include more than 150 genes and more than 3.6 million genetic letters. But these footnotes are inconvenient, and for the most part, simply ignored.

In many cases, it doesn’t make a difference. Any two people share the vast majority of their DNA letters, over 99 percent. And the reference genome is able to highlight the most common differences, which are changes to single DNA letters inside of genes.

But it’s bad at flagging certain larger chunks of DNA that can go missing, get added, or get rearranged. And these are important: some have been linked to autism (see “Solving the Autism Puzzle”), while others appear to be key parts of what separate us from the apes.

“It’s quite a bit of a shortcoming if the essence of what makes us human is in these highly variable regions,” says Alex Lash, chief informatics officer of the Simons Foundation, in New York, which is supporting work on a graph genome at Santa Cruz.

Lash says the idea of building a graph genome is “high risk,” since billions of dollars have been invested in software and scientific efforts supporting the current reference, and people may not want to switch over.

Just how big an advance a graph genome will be also remains to be proved, says Deniz Kural, CEO of Seven Bridges Genomics. His company has developed a graph representation of the genome by using public DNA information on 70,000 people, as well as tools to analyze it. It’s now testing how much more accurate genomes assembled using the graph really are.

Kural says the amount of data needed to describe, mathematically, every possible variation of the human genome could fit on a flash drive; it’s only around one gigabyte. Describing the meaning of those variations is a far more extensive undertaking, encompassing tens of thousands of scientific articles.

Scientists involved in developing a graph approach say they expect their idea will win out, even if it means putting the map created by the original Human Genome Project into deep storage. “It’s kind of a ridiculous thing to do, to shoehorn everyone through the lens of this one reference genome,” says Michael Schatz, a bioinformatics researcher at the Cold Spring Harbor Laboratory. “It was a tremendous milestone to have that first representation of the genome, but now we are outgrowing it. It’s time to reboot.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.