One challenge in devising a distribution system that can locate similar files is that the system must search not just for each file but also for every chunk within that file. A 700-megabyte video clip may be divided into 40,000 chunks, which means that the system must make several billion comparisons. SET is a hybrid system that first locates users with identical files before searching for requested chunks in file variants. SET’s innovation in the latter task is what the researchers call handprinting, which efficiently identifies similar files using a constant number of search queries regardless of the file size. SET divides the requested file into 16-kilobyte chunks, which are then distilled into 160-bit-chunk hashes, or fingerprints. These fingerprints are sorted based on their numeric value, and the system selects the first few to form the handprint. Comparing handprints, says Andersen, “gives you a 90 percent chance of discovering a file that is 10 percent or more similar.”
Locating that file with just 10 percent similarity could speed up downloads by 8 percent. For music files with greater than 90 percent similarity, a five-minute download on BitTorrent would take just over two minutes with SET. For a single user, the savings could be even greater if he or she happens to be downloading an unpopular variant of a common file. Andersen proposes a scenario in which a U.S.-based user downloads a German version of a popular movie. Currently, the movie would most likely be transferred from a slower overseas connection. But with SEC, users could take advantage of faster local sources for video and receive only the audio from German peers.
“It’s a very clever scheme for finding the chunks in common,” says Sirer. However, he says that “for the most popular content, [SET] won’t make too much of a difference because there are already plenty of other peers who host that content. But I can imagine that other content which would otherwise be slow to get from a single swarm might actually be easier to download.”
Although the researchers have released the source code for the SET system, they have no plans to build a graphical user interface for it or to deploy it in current file-sharing networks. “The math behind it was complex to analyze,” Andersen says, “but the idea is relatively straightforward, and the implementation won’t be bad.” He says he wouldn’t be surprised if someone deployed the SET system in the next year.