Software Mines Science Papers to Make New Discoveries

Software digests thousands of research papers to accurately identify proteins that could be productive targets for cancer drugs.

Tom Simonitearchive page

November 25, 2013

Software that read tens of thousands of research papers and then predicted new discoveries about the workings of a protein that’s key to cancer could herald a faster approach to developing new drugs.

The software, developed in a collaboration between IBM and Baylor College of Medicine, was set loose on more than 60,000 research papers that focused on p53, a protein involved in cell growth, which is implicated in most cancers. By parsing sentences in the documents, the software could build an understanding of what is known about enzymes called kinases that act on p53 and regulate its behavior; these enzymes are common targets for cancer treatments. It then generated a list of other proteins mentioned in the literature that were probably undiscovered kinases, based on what it knew about those already identified. Most of its predictions tested so far have turned out to be correct.

“We have tested 10,” Olivier Lichtarge of Baylor said Tuesday. “Seven seem to be true kinases.” He presented preliminary results of his collaboration with IBM at a meeting on the topic of Cognitive Computing held at IBM’s Almaden research lab.

Lichtarge also described an earlier test of the software in which it was given access to research literature published prior to 2003 to see if it could predict p53 kinases that have been discovered since. The software found seven of the nine kinases discovered after 2003.

“P53 biology is central to all kinds of disease,” says Lichtarge, and so it seemed to be the perfect way to show that software-generated discoveries might speed up research that leads to new treatments. He believes the results so far show that to be true, although the kinase-hunting experiments are yet to be reviewed and published in a scientific journal, and more lab tests are still planned to confirm the findings so far. “Kinases are typically discovered at a rate of one per year,” says Lichtarge. “The rate of discovery can be vastly accelerated.”

Lichtarge said that although the software was configured to look only for kinases, it also seems capable of identifying previously unidentified phosphatases, which are enzymes that reverse the action of kinases. It can also identify other types of protein that may interact with p53.

The Baylor collaboration is intended to test a way of extending a set of tools that IBM researchers already offer to pharmaceutical companies. Under the banner of accelerated discovery, text-analyzing tools are used to mine publications, patents, and molecular databases. For example, a company in search of a new malaria drug might use IBM’s tools to find molecules with characteristics that are similar to existing treatments. Because software can search more widely, it might turn up molecules in overlooked publications or patents that no human would otherwise find.

“We started working with Baylor to adapt those capabilities, and extend it to show this process can be leveraged to discover new things about p53 biology,” says Ying Chen, a researcher at IBM Research Almaden.

It typically takes between $500 million and $1 billion dollars to develop a new drug, and 90 percent of candidates that begin the journey don’t make it to market, says Chen. The cost of failed drugs is cited as one reason that some drugs command such high prices (see “A Tale of Two Drugs”).

Lawrence Hunter, director of the Center for Computational Pharmacology at the University of Colorado Denver, says that careful empirical confirmation is needed for claims that the software has made new discoveries. But he says that progress in this area is important, and that such tools are desperately needed.

The volume of research literature both old and new is now so large that even specialists can’t hope to read everything that might help them, says Hunter. Last year over one million new articles were added to the U.S. National Library of Medicine’s Medline database of biomedical research papers, which now contains 23 million items. Software can crunch through massive amounts of information and find vital clues in unexpected places. “Crucial bits of information are sometimes isolated facts that are only a minor point in an article but would be really important if you can find it,” he says.

Lichtarge believes that software like his could change the way scientists conduct and assess new research findings. Scientists currently rely in part on the reputation of the people, institutions, and journals involved, and the number of times a paper is cited by others.

Software that gleans meaning from all the information published within a field could offer a better way, says Lichtarge. “You might publish directly into the [software] and see how disruptive it is,” he says.

Hunter thinks that scientists might even use such tools at an earlier stage, having software come up with evidence for and against new hypotheses. “I think it would really help science go faster. We often waste a lot of time in the lab because we didn’t know every little thing in the literature,” he says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.