Technology Review - Published By MIT
Advertisement

Fingerprinting Your Files

Continued from page 2

By Simson Garfinkel

8/04/2004

smaller text tool iconmedium text tool iconlarger text tool icon

How Hash Functions Work

So that's why hash functions are helpful. Now, let's see what they actually look like.

Among the most widely used hash functions today are the so-called MD5 (for Message Digest #5). MD5 produces a hash that is 128 bits long and that is commonly written as sequence of 32 hexadecimal (base 16) digits. If you were to take my name and process it with MD5, you would get this seemingly random string:

c55bbe0f3ba258f5b1cb6d5b62b0b360

Or, to state it with more mathematical formality:

MD5(Simson Garfinkel)= c55bbe0f3ba258f5b1cb6d5b62b0b360

Each of those hexadecimal characters represents 4 bits; the MD5 value of my name is actually:

1100010101011011101111100000111100111011101
0001001011000111101011011000111001011011011
010101101101100010101100001011001101100000

Most people work with the hexadecimal representation because its pretty easy to look two hashes and tell if they are the same or different.

MD5 works by splitting the file up into lots of small pieces, and then taking each of those chunks and performing hundreds of mathematical operations that shuffle, invert, transpose, and otherwise process the bits into an unrecognizable mess. The word unrecognizable in this description is key. The fundamental requirement of a good hash function is that it should be impossible to predict the fingerprint of a file without actually going to the effort of computing that fingerprint there must be no short-cuts. If there were, you might be able to run the hash function backwards and create a file that had a specific hashfor example, the hash of another file. Indeed, the entire security of hash functions falls apart utterly if it is possible to generate two files that have the same hash.

The beauty of the hash function is that even a tiny modification to the input produces a dramatic change in the output. Mathematically, the functions are designed so that every bit in the output will have a 50 percent chance of changing for every single bit changed in the input.

Lets look at another MD5 hash, this one of a slightly different representation of my name:

MD5(Simson L. Garfinkel)= df876e8e6f548d5be698fab7f06dd278

Merely adding "L." produces a completely different hash. If you compare the two hashes bit-for-bit youll find that 63 out of the 128 positions have changed from a 0-to-1 or a 1-to-0, and the other 65 have remained unchanged.

Unfortunately, the whole theory of cryptographic hash functions has a huge problem. The use of these functions requires that there be no so-called "collisions". Either accidentally or on purpose, there should be no two files that have the same cryptographic fingerprint. And as it turns out, this is an impossible requirement.

The reason is pretty simple. File fingerprints are a fixed size, which means that there is a finite number of possible fingerprints. Files, on the other hand, can be any size. Thus, there are more possible files than fingerprints, and so there must be at least one fingerprint that is the fingerprint of multiple files. The mathematical term for this is the pigeonhole principle. Indeed, even if you restrict yourself to files that are just nine characters long, there are still 256 times the number of possible files as the number of possible fingerprints.

The reason that the pigeonhole principle doesnt render hash functions completely pointless is that there are an astounding number of possible fingerprintsfar more, in fact, than the number of files on the planet. (With MD5 there are 2128 possible fingerprints. Now, the total number of computer hard drives that have ever been manufactured is only around 229. If every hard drive had a million unique filesa gross overestimationthere would still be only 249 individual files. Thats a much, much, much smaller number than 2128.)

Comments

  • file
    a javascript file md5 generator?
    Rate this comment: 12345
    Guest (8cf14c1ca9280af0e8525011007c2404)
    12/30/2005
    Posts:1
  • file
    a javascript file md5 generator?
    Rate this comment: 12345
    Guest (8cf14c1ca9280af0e8525011007c2404)
    12/30/2005
    Posts:1

Log In

Forgot your password?     Register »
Advertisement

Videos

Laser-Triggered Chemical Reactions
Featured Content
Sponsored by:
White Papers

Twelve ways to reduce costs with SQL Server 2008
Find out how to reduce costs and get more efficient

Download

Total Economic Impact of SQL Server 2008 Upgrade
Forrester reports on increasing productivity and management capabilities

Download 

Achieving Cost and Resource Savings with UC
How Office Communications Server R2 and Exchange Server can make your business smarter and more efficient

Download 

The Compelling Case for Conferencing
Read how you can improve workload support and find IT efficiencies

Download

How Windows Server 2008 R2 Helps Optimize IT and Save you Money
Read how you can improve workload support and find IT efficiencies

Download

Windows Server 2008 R2 Hyper-V Live Migration
See how Windows Server 2008 R2 and Hyper-V enable virtualization and Live Migration

Download
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology © 2009 Technology Review. All Rights Reserved.