How to Build a Better DNA Search Engine

The techniques for indexing Chinese language websites could dramatically improve the speed of bioinformatic searches, according to research by SOSO, the third-largest Chinese search engine.

Emerging Technology from the arXivarchive page

June 30, 2010

If there’s one thing that Google has taught the current generation of web savvy surfers, it is that internet searches are quick. The small print at the top of every search it delivers stamps this idea into the culture of search.

Type the word “physics” into the search engine, for example, and it delivers 102,000,000 results in 0.21 seconds. That’s mind-blowingingly fast.

That might sound like good news for researchers combing bioinformatic databases. These databases are huge and growing exponentially. They contain, for example, the rapidly increasing number of genomes from different species around the planet as well as the genomes of different individuals within the same species.

Given our experience with web search, it’s easy to imagine that finding a gene that is common to more than one organism or individual ought to be as quick as searching Google. But it isn’t.

The reason according to Wang Liang, a computer scientist at SOSO.com, one of the big three search engines in China, is that bioinformatics has failed to exploit the basic search techniques that have made search engines like Google so quick.

Most bioinformatic searches use either the BLAST or FASTA algorithms. These essentially compare the data from one genome with the data from another, then with another and so on. That’s satisfactory when there are a relatively small number of genomes but it quickly becomes unmanageable as the number of genomes increases exponentially.

Search engines faced exactly this problem 20 years ago with the growth of the world wide web. Search engines initially indexed the web by recording the words that each document contained. Searching for a specific word then meant looking for it in one web page, then in another and another and so on. This approach becomes increasingly slow as the number of documents grows..

So the engines took another approach: they turned the indexing process on its head creating what is known as an inverted index. “The idea of an inverted index is very simple,” says Liang.

Instead of creating a list of web pages and the words on each page, the indexing process records for each word, a list of webpages where it appears.

So a search now looks only through the list of words that a search engine has indexed. When it finds the word, that entry also records the webpages where it appears. In other words, instead of searching an index of webpages to find a specific word, you search though an index of words to find the webpages where it appears.

That dramatically simplifies things but there are various complexities that making the indexing process tricky. For example, in English, the spaces between words show clearly where each word starts and finishes. That isn’t the case in genetic data. So one important questions is what constitutes a word.

Liang says that an important clue comes from the way search engines index languages like Chinese where there are no spaces between words either. One way to index a Chinese document is to segment the text into n-grams, words that are n-letters long. So you start by segmenting it into 1-grams, one letter words, then 2-grams, two letter words. A search for a 3 letter word, such as ABC, can then be done by searching for the 2-grams AB and BC.

In fact, some Chinese search engines work in exactly this way, by indexing all the 2-grams.

But how many letters are there in a genetic word, what n-grams should a search engine index? A 1-gram segmentation gives only four words, the base nucleotides A, T, C and G. But that’s no good because the combined searches needed for longer words are then unmanageable.

The answer comes from the statistical distribution of words in DNA sequences which Liang says follows Zipf’s law. This essentially states that in any long document, 50 per cent of the words appear only once. This can be used to find a kind of average length length of DNA words.

In Chinese for example, the percentage of 1-gram words that appear only once is less than 50 per cent, the percentage of 2-gram words that appear only once is about 50 percent and the percentage of 3-gram words is less than 50 per cent. So 2-gram words are a good average.

Liang applies the same criteria to find the average length of words in the genomes of arabidopsis, aspergillus, the fruit fly and the mouse. And he finds that a good average word length is about 12 letters. So the best way to index genome data is to look for 12-grams, he says.

None of this needs any new technology to complete. Liang says that the open source search engine Lucene is the perfect forum in which to do the work and, impressively, has even used it to build a rudimentary bioinformatics search engine himself.

It makes sense that the huge improvements in search that have been made by commercial search engines ought to find application in the bioinformatics world. Perhaps there’s even a decent business model in such a plan, for example by serving ads targeted at the kind of people who do bioinformatics search.

The only question is who will lead the way in this area. And if this work is anything to go by, it looks as if the Chinese search engine SOSO has the lead.

Ref: arxiv.org/abs/1006.4114: How To Build A DNA Search Engine Like Google?

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.