Catching Spammers in the Act

Researchers show how spammers harvest e-mail addresses and send out bulk messages.

Robert Lemosarchive page

July 15, 2009

Researchers have shed new light on the methods by which spammer harvest e-mail addresses from the Web and relay bulk messages through multiple computers. They say that findings could provide additional ammunition in the fight against junk e-mail campaigns.

The problem of unwanted e-mail messages, or spam, continues to vex computer users and security professionals. Currently, more than 90 percent of the e-mail messages traversing the Internet appear to be spam, according to the information released in June by the e-mail security firm MessageLabs.

In one paper scheduled to be presented this week at the Conference on E-mail and Anti-Spam, in Mountain View, CA, researchers from Indiana University studied how spammers obtain the e-mail addresses in the first place. The researchers used a variety of techniques to match the programs that cull e-mail addresses from Web pages to the resulting spam. “We are basically trying to figure out how spammers get your address–the addresses of people that they try to victimize,” says Craig Shue, a graduate student at Indiana University who now works at Oak Ridge National Laboratory.

This involved exposing 22,230 unique e-mail addresses on the Web over a five-month period and watching for spam sent to those destinations. The researchers found that an e-mail address included in a comment posted to a website had a much higher probability of resulting in spam. While only four e-mail addresses submitted to 70 websites during registration resulted in spam, half of the e-mail addresses posted to popular sites resulted in spam.

The researchers also set up a website on their own domain and waited for their pages to be crawled. Each visitor to the website would see a different e-mail, a strategy that the researchers hoped would gauge how often programs that automatically crawl sites are operated by spammers. “We are giving out a unique e-mail address to every visitor to our webpage,” Shue says. “If we ever get an e-mail to that address, we know that the crawler gave that e-mail address to a spammer.”

The researchers also found that the programs that crawl the Web looking for e-mail addresses–dubbed spamming crawlers–have characteristics that could make it easier to detect them. For example, the parts of a network from which a crawler operates tend to be a good predictor of whether it is a legitimate crawler, such as those used by Google or other search engines, or a spamming crawler. “It may be feasible to block a small number of [network numbers] associated with spammer Web crawlers to eliminate the harvesting of e-mail addresses on a site,” the Indiana University researchers wrote.

Many end users protect themselves against e-mail harvesting using simple obfuscation techniques–for example, using “-at-” to replace the “@” symbol in an e-mail address. The researchers found that these methods frustrate current spam techniques surprisingly well. In addition, they found that submitting an e-mail address to a legitimate website rarely resulted in spam. “If you sign up with reputable organizations, you will be fine,” Shue says. “If you go to less reputable sites, then you will get spam.”

In a separate paper to be presented at the same conference, researchers from the Federal University of Minas Gerais (UFMG), in Brazil, and Brazil’s Network Information Center show that spammers tend to combine different techniques to hide the origin of their junk e-mail messages. While many spam groups have adopted the use of botnets to anonymize the source of their e-mail messages, a significant number use a chain of different compromised machines, according to Pedro Calais Guerra, a PhD student at UFMG.

“The key factor for a spammer to succeed in terms of hiding his identity on the network is to spread his activity as much as he can,” says Guerra, who believes that the team’s study could be used to help fight spam by identifying which messages should be blocked. “We think it may have an impact on the design of blacklists.”

Guerra and five other researchers monitored special servers, known as honeypots, collecting 525 million spam e-mail messages sent from more than 216,000 Internet addresses over a 15-month period. They found, for example, that nearly 95,000 machines used by spammers were end-user computers that relayed messages and not mail servers, a third of which were in the United States and a quarter in Taiwan.

The chains of computers used by the spammers to anonymize the origins of spam fell into two categories: open proxies and open relays. The open proxies are compromised servers that forward data to other computers on the network, hiding the sender’s address; open relays receive e-mail messages for another domain, passing the message to the next server. The researchers found that spammers typically use each open relay to forward e-mail for only a short time, to avoid having the e-mail server added to a blacklist.

“We show in our paper that spammers send high volumes of spam to open proxies but low volumes of spam to open relays,” UFMG’s Guerra says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.