Let’s face it: peer-to-peer file transfers on the Internet are slow. More than half of all downloads fail, and the average transfer time for a 100-megabyte file is more than 24 hours. But now, a team of computer scientists led by Himabindu Pucha at Purdue University, in Indiana, say that they can double the speed of these transfers by taking advantage of overlap in data chunks contained within nonidentical multimedia files posted on peer-to-peer distribution networks. This would improve the likelihood of success of these transfers.
Peer-to-peer distribution networks such as BitTorrent and Kazaa allow people to download individual files from others’ computers. These systems first locate the copies of the requested file in the network’s global lookup table using its “hash”–a unique identifier computed from the file’s data sequence. Then, the file is divided into chunks so that each user’s computer only has to upload a small piece of it. This technique speeds up file transfers because home users typically have greater bandwidth allocated to downloads compared with uploads. Of course, the overall speed of the transfer will depend on the number of file sources and how much spare upload capacity they have. The more popular a file is, the faster it is to download and the greater the chance of success.
Computer scientist David Andersen, a professor of computer science at Carnegie Mellon University, worked with the Purdue group to develop a way to increase the size of the pool of uploaders called similarity-enhanced transfer (SET). The approach takes advantage of multiple variants of the same music files, video clips, and software, which are often floating around file-distribution networks. “We hope that SET gives you access to a larger pool of people to download from,” says Andersen. “And by doing so, we think you’re more likely to find one of these people who have more spare capacity.”
Before Andersen and his colleagues conducted their study, it was not at all clear how much redundancy existed in file-sharing networks and whether it could be exploited, says Cornell University computer scientist Emin Gün Sirer, who was not involved in the study. The SET team analyzed almost two terabytes of music and video files from file-sharing networks, and it discovered that similar files typically shared anywhere between 20 and 99 percent of their content. With music files, even misspellings in user-defined header labels that identify artist and song titles are enough to throw off BitTorrent, despite the fact that 99 percent of the file is the same. Similarly, multiple versions of the same video are often available with different language tracks.
One challenge in devising a distribution system that can locate similar files is that the system must search not just for each file but also for every chunk within that file. A 700-megabyte video clip may be divided into 40,000 chunks, which means that the system must make several billion comparisons. SET is a hybrid system that first locates users with identical files before searching for requested chunks in file variants. SET’s innovation in the latter task is what the researchers call handprinting, which efficiently identifies similar files using a constant number of search queries regardless of the file size. SET divides the requested file into 16-kilobyte chunks, which are then distilled into 160-bit-chunk hashes, or fingerprints. These fingerprints are sorted based on their numeric value, and the system selects the first few to form the handprint. Comparing handprints, says Andersen, “gives you a 90 percent chance of discovering a file that is 10 percent or more similar.”
Locating that file with just 10 percent similarity could speed up downloads by 8 percent. For music files with greater than 90 percent similarity, a five-minute download on BitTorrent would take just over two minutes with SET. For a single user, the savings could be even greater if he or she happens to be downloading an unpopular variant of a common file. Andersen proposes a scenario in which a U.S.-based user downloads a German version of a popular movie. Currently, the movie would most likely be transferred from a slower overseas connection. But with SEC, users could take advantage of faster local sources for video and receive only the audio from German peers.
“It’s a very clever scheme for finding the chunks in common,” says Sirer. However, he says that “for the most popular content, [SET] won’t make too much of a difference because there are already plenty of other peers who host that content. But I can imagine that other content which would otherwise be slow to get from a single swarm might actually be easier to download.”
Although the researchers have released the source code for the SET system, they have no plans to build a graphical user interface for it or to deploy it in current file-sharing networks. “The math behind it was complex to analyze,” Andersen says, “but the idea is relatively straightforward, and the implementation won’t be bad.” He says he wouldn’t be surprised if someone deployed the SET system in the next year.