# Fingerprinting Your Files

Page 3 of 4

**How Hash Functions Work**

So that’s why hash functions are helpful. Now, let’s see what they actually look like.

Among the most widely used hash functions today are the so-called MD5 (for Message Digest #5). MD5 produces a hash that is 128 bits long and that is commonly written as sequence of 32 hexadecimal (base 16) digits. If you were to take my name and process it with MD5, you would get this seemingly random string:

c55bbe0f3ba258f5b1cb6d5b62b0b360

Or, to state it with more mathematical formality:

MD5(Simson Garfinkel)= c55bbe0f3ba258f5b1cb6d5b62b0b360

Each of those hexadecimal characters represents 4 bits; the MD5 value of my name is actually:

1100010101011011101111100000111100111011101

0001001011000111101011011000111001011011011

010101101101100010101100001011001101100000

Most people work with the hexadecimal representation because its pretty easy to look two hashes and tell if they are the same or different.

MD5 works by splitting the file up into lots of small pieces, and then taking each of those chunks and performing hundreds of mathematical operations that shuffle, invert, transpose, and otherwise process the bits into an unrecognizable mess. The word unrecognizable in this description is key. The fundamental requirement of a good hash function is that it should be impossible to predict the fingerprint of a file without actually going to the effort of computing that fingerprint there must be no short-cuts. If there were, you might be able to run the hash function backwards and create a file that had a specific hashfor example, the hash of another file. Indeed, the entire security of hash functions falls apart utterly if it is possible to generate two files that have the same hash.

The beauty of the hash function is that even a tiny modification to the input produces a dramatic change in the output. Mathematically, the functions are designed so that every bit in the output will have a 50 percent chance of changing for every single bit changed in the input.

Lets look at another MD5 hash, this one of a slightly different representation of my name:

MD5(Simson L. Garfinkel)= df876e8e6f548d5be698fab7f06dd278

Merely adding “L.” produces a completely different hash. If you compare the two hashes bit-for-bit youll find that 63 out of the 128 positions have changed from a 0-to-1 or a 1-to-0, and the other 65 have remained unchanged.

Unfortunately, the whole theory of cryptographic hash functions has a huge problem. The use of these functions requires that there be no so-called “collisions”. Either accidentally or on purpose, there should be no two files that have the same cryptographic fingerprint. And as it turns out, this is an impossible requirement.

The reason is pretty simple. File fingerprints are a fixed size, which means that there is a finite number of possible fingerprints. Files, on the other hand, can be any size. Thus, there are more possible files than fingerprints, and so there must be at least one fingerprint that is the fingerprint of multiple files. The mathematical term for this is the pigeonhole principle. Indeed, even if you restrict yourself to files that are just nine characters long, there are still 256 times the number of possible files as the number of possible fingerprints.

The reason that the pigeonhole principle doesnt render hash functions completely pointless is that there are an astounding number of possible fingerprintsfar more, in fact, than the number of files on the planet. (With MD5 there are 2^{128} possible fingerprints. Now, the total number of computer hard drives that have ever been manufactured is only around 2^{29}. If every hard drive had a million unique filesa gross overestimationthere would still be only 2^{49} individual files. Thats a much, much, much smaller number than 2^{128}.)

## 2 comments. Share your thoughts » 0 comments about this story. Start the discussion »