Skip to Content
Computing

The Rare Disease Search Engine That Outperforms Google

A powerful new search engine designed to help diagnose rare diseases could prove a boon for both medics and the public

 

In the late 1940s, a professor at the University of Maryland School of Medicine coined an unusual phrase to describe unexpected diagnoses. “When you hear hoofbeats behind you, don’t expect to see a zebra,” he said. The phrase stuck and today, medics commonly use the term “zebra” to describe a rare disease, usually defined as one that occurs in less than 1 in 2000 of the population. 

Rare diseases are inherently hard to diagnose. According to the European Organisation for Rare Disease, 25 per cent of diagnoses are delayed by between 5 and 30 years.

So it’s no surprise that medics are looking for more effective ways to do the job. An increasingly common aid in this process is the search engine, typically Google.  This forms part of an iterative process in which a medic enter symptoms into a search engine, examines lists of potential diseases and then looks for further evidence of symptoms in the patient.

The problem, of course, is that  common-or-garden search engines are not optimised for this process. Google, for example, considers pages important if they are linked to by other important pages, the basis of its famous PageRank algorithm. However, rare diseases by definition are unlikely to have a high profile on the web. What’s more, searches are likely to be plagued with returns from all sorts of irrelevant sources.

Today, Radu Dragusin at the Technical University of Denmark and a few pals unveil an alternative. These guys have set up a bespoke search engine dedicated to the diagnosis of rare diseases called FindZebra, a name based on the common medical slang for a rare disease. After comparing the results from this engine against the same searches on Google, they show that it is significantly better at returning relevant results.

The magic sauce in FindZebra is the index it uses to hunt for results. These guys have created this index by crawling a specially selected set of curated  databases on rare diseases. These include the Online Mendelian Inheritance in Man database, the Genetic and Rare Diseases Information Center and Orphanet

They then use the open source information retrieval tool Indri  to search this index via a website with a conventional search engine interface. The result is FindZebra.

Finally, they compared the results of  searches on FindZebra against the same search on Google applied to the same limited dataset, a feature that is possible with advanced Google searches.  Dragusin and co say that the Google results are significantly worse than their own.

For example, on FindZebra the search query “Boy, normal birth, deformity of both big toes (missing joint), quick development of bone tumor near spine and osteogenesis at biopsy” returns the correct diagnosis “Fibrodysplasia ossificans progressiva” as the first result. However, this diagnosis does not appear at all in the results from any type of Google search.

This indicates that the PageRank algorithm, or at least the way Google has tweaked it, is not suited to this kind of search. “Our finding, that FindZebra outperforms Google overall for this task and especially when restricted to the sites of our collection (Google Restricted), suggests that Google ranking algorithm is suboptimal for the task at hand,” they conclude.

Although still a research project, Dragusin and co have made their rare disease search engine publicly available at www.findzebra.com. This could clearly become a valuable tool for the medical community.

What is less clear, however, is how this tool will be used by the general public. The site comes with the forlorn message: “Warning! FindZebra is a research project and it is to be used only by medical professionals” .

FindZebra could obviously be a hypochondriac’s charter. On the other hand, that’s true of any medical dictionary.

The informed public are increasingly visiting their doctors armed with detailed information downloaded form the internet.  Any move to improve the quality of this information must surely be of significant value.

Ref: arxiv.org/abs/1303.3229: FindZebra: A Search Engine For Rare Diseases

Deep Dive

Computing

Linux hack concept
Linux hack concept

The US military wants to understand the most important software on Earth

Open-source code runs on every computer on the planet—and keeps America’s critical infrastructure going. DARPA is worried about how well it can be trusted

Close up of worker inspecting chip in a clean room
Close up of worker inspecting chip in a clean room

Corruption is sending shock waves through China’s chipmaking industry

The arrests of several top semiconductor fund executives could force the government to rethink how it invests in the sector.

inflection point post-NSO concept
inflection point post-NSO concept

The hacking industry faces the end of an era

But even if NSO Group is no more, there are plenty of rivals who will rush in to take its place. And the same old problems haven’t gone away.

The Western Union Building, 60 Hudson Street, c. 1931.
The Western Union Building, 60 Hudson Street, c. 1931.

Energy-hungry data centers are quietly moving into cities

Companies are pushing more server farms into the hearts of population centers.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.