Skip to Content

The Unexpected Ubiquity of Spam Detection Algorithms

‘Fuzzy hashing’ was invented to flag spam emails, but has found application in everything from malware detection to genome sequence alignment

If you want to find out if two spam emails are identical without comparing every single byte in them – which would take an undue amount of time – you digest the file into a short string, known as a cryptographic hash value. It’s in the nature of a cryptographic hash function that even the tiniest change in a file will yield a wildly different hash value, so it’s easy to tell if two files are identical.

From Wikipedia: “A cryptographic hash function at work. Note that even small changes in the source input drastically change the resulting output.”

This may sound trivial, but it’s not: if you already have an example of an undesirable file – say a piece of spam email – and you want to compare it pairwise to millions of other emails streaming through your servers, you need to be able to identify it quickly. Comparing 40 digit strings is significantly easier than reading millions of penis pill emails all the way through.

The problem with cryptographic hash functions is that even the tiniest change in the source file yields a hash value that’s completely different. In other words they’re useless for comparing files that may be functionally identical but only a little bit different.

One solution to this problem is what’s known as a fuzzy hash, or a rolling hash: instead of just hashing the entire file, the algorithm “provide[s] a continuous stream of hash values for a rolling window over the binary [of the original file],” says David French, a senior researcher at Carnegie Mellon’s Software Engineering Institute.

By comparing a bunch of hashes from two files, you can determine the percentage difference between them.

An Approach That Generalizes to… Everything?

This is where fuzzy hashing becomes really useful: suddenly, you’ve got a general purpose method for rapidly comparing any stream of information.

So you see fuzzy hashes cropping up all over the place. Here’s an effort by the aforementioned Dr. French to use fuzzy hashes to rapidly determine whether a file on a user’s computer is a virus or other piece of malware. And here’s an attempt to use fuzzy hashes to compare the draft gene sequence of one microorganism to a known sequence from a related creature. This is huge for geneticists, for whom the fastest way to assemble a reasonably accurate genome from a previously un-sequenced creature is to compare the sequenced fragments of its genes to a known sequence.

And have you ever wondered how Google detects duplicate content so that it can deliver to you only the most relevant results, instead of the hundreds of copies of a news article or reference work that inevitably end up on splogs spread across the Internet? Fuzzy hashing, again! It can even detect identical documents that are stored in completely different formats:

*A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs …). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc.

Hell, you can tell just how important fuzzy hashing is because Michael Gregory Hoglund of Monte Sereno, CA just received an all-encompassing patent on the process, which he calls “generating a digital DNA sequence” for a data object. Before you know it, a patent troll will have bought this one off of Hoglund, and then the real fun can begin.

Fuzzy hashes are one of those invisible pieces of plumbing, like the MapReduce algorithm invented by Google that underpins practically every giant web service you can think of, that is critical to the operation of the global hive-mind on which we depend.

Anytime a computer must ask, “Which of these things is not like the other?” The answer is: Ask a fuzzy hash function!

Follow Mims on Twitter or contact him via email.

Keep Reading

Most Popular

wet market selling fish
wet market selling fish

This scientist now believes covid started in Wuhan’s wet market. Here’s why.

How a veteran virologist found fresh evidence to back up the theory that covid jumped from animals to humans in a notorious Chinese market—rather than emerged from a lab leak.

light and shadow on floor
light and shadow on floor

How Facebook and Google fund global misinformation

The tech giants are paying millions of dollars to the operators of clickbait pages, bankrolling the deterioration of information ecosystems around the world.

masked travellers at Heathrow airport
masked travellers at Heathrow airport

We still don’t know enough about the omicron variant to panic

The variant has caused alarm and immediate border shutdowns—but we still don't know how it will respond to vaccines.

This new startup has built a record-breaking 256-qubit quantum computer

QuEra Computing, launched by physicists at Harvard and MIT, is trying a different quantum approach to tackle impossibly hard computational tasks.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.