Microsoft Team Traces Malicious Users

Three researchers find a way to trace compromised machines used to attack other computers.

Robert Lemosarchive page

August 13, 2009

Anonymity on the Internet can be both a blessing and a curse. While the ability to hide behind anonymous proxies and fast-changing Internet protocol (IP) addresses has enabled freer speech in nations with repressive regimes, the same technologies allow cybercriminals to hide their tracks and pass off malicious code and spam for legitimate communications.

In a paper to be presented next week at SIGCOMM 2009 in Barcelona, Spain, three researchers from Microsoft’s research center in Mountain View, CA, demonstrate a way to remove the shield of anonymity from such shadowy attackers. Using a new software tool, the three computer scientists were able to identify the machines responsible for malicious activity, even when the host’s IP address changed frequently.

“What we are really trying to get at is the host responsible for an attack,” said Yinglian Xie, a member of the Microsoft team. “We are not trying to track those identifiers but associate them with a particular host.”

The prototype system, dubbed HostTracker, could result in better defenses against online attacks and spam campaigns. Security firms could, for example, build a better picture of which Internet hosts should be blocked from sending traffic to their clients, and cybercriminals would have a harder time camouflaging their activities as legitimate traffic.

Xie and her colleagues, Fang Yu and Martin Abadi, analyzed a month’s worth of data–330 gigabytes–collected from a large e-mail service provider, in an attempt to determine which users were responsible for sending out spam. To trace the origins of multiple spam outbreaks, the scientists studied records including more than 550 million user IDs, 220 million IP addresses, and a time stamp for events such as sending a message or logging into an account.

Tracing the origins of messages–a key task for tracking spam and other kinds of Internet attack–involved reconstructing relationships between account IDs and the hosts from which users connected to the e-mail service. To do this, the researchers clumped together all the IDs accessed from different hosts over a certain time period. The HostTracker software then combed through this data to resolve any conflicts. For example, sometimes more than one user appeared to originate from the same IP address or a single user had multiple ID addresses during overlapping periods of time.

HostTracker resolves the conflicts by cross referencing the data to identify proxy servers, which allow several hosts to appear as a single IP address, and to determine when a guest was using a legitimate host. “The fact that we are able to trace malicious traffic to the proxy itself is an improvement because we are able to pinpoint the exact origin,” Xie says.

The researchers also created a way to automatically blacklist traffic from a particular IP address, once the HostTracker system has determined that the host at that address is compromised. Using this method in simulation, the researchers were able to block malicious traffic with an error rate of five percent–in other words, 5 out of 100 IP addresses classified as malicious were actually legitimate. Using additional information to identify good user behavior reduced that false-positive rate to less than one percent.

The results suggest that HostTracker would be a good way to refine the current way of defending against distributed denial-of-service attacks and spam campaigns, says Gunter Ollmann, vice president of research and development at Damballa, a firm that helps companies find and eliminate compromised hosts in a computer network.

“Using this technique will help find botnets that have a high frequency of traffic, such as spam campaigns, DDoS attacks, and maybe click-through attacks,” Ollmann says. “Other attacks, such as password-stealing and banking trojans, where the attack is more host-centric–this sort of technique would not be as effective.”

Xie acknowledges that while the technique is useful for creating lists of hosts to track, it may be less useful for law enforcement agencies attempting to identify the attackers behind online crime. “The accountability we are talking about is not court accountability,” she says. “We want to separate the two notions. The accountability that we are talking about is the ability to identify the hosts.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.