A Better Way to Shoot Down Spam

Junk mail can now be identified based on a single packet of data.

Rachel Kremenarchive page

July 29, 2009

New software developed at the Georgia Institute for Technology can identify spam before it hits the mail server. The system, known as SNARE (Spatio-temporal Network-level Automatic Reputation Engine), scores each incoming e-mail based on a variety of new criteria that can be gleaned from a single packet of data. The researchers involved say the automated system puts less of a strain on the network and minimizes the need for human intervention while achieving the same accuracy as traditional spam filters.

Separating spam from legitimate e-mail, also known as ham, isn’t easy. That’s partly because of the sheer volume of messages that need to be processed and partly because of e-mail expectations: users want their e-mail to arrive minutes, if not seconds, after it was sent. Analyzing the content of every e-mail might be a reliable method for identifying spam, but it takes too long, says Nick Feamster, an assistant professor at Georgia Tech who oversaw the SNARE research. Letting spam flow into our in-boxes unfiltered isn’t a sensible option, either. According to a report released by the e-mail security firm MessageLabs, spam accounted for 90.4 percent of all e-mail sent in June.

“If you’re not concerned about spam, I would suggest you turn off your spam filter for about an hour and see what happens,” says Sven Krasser, senior director of data-mining research at McAfee. The Santa Clara, CA, company provided raw data for analysis by the Georgia Tech team.

The team analyzed 25 million e-mails collected by TrustedSource.org, an online service developed by McAfee to collate data on trends in spam and malware. Using this data, the Georgia Tech researchers discovered several characteristics that could be gleaned from a single packet of data and used to efficiently identify junk mail. For example, their research revealed that ham tends to come from computers that have a lot of channels, or ports, open for communication. Bots, automated systems that are often used to send out reams of spam, tend to keep open only the e-mail port, known as the Simple Mail Transfer Protocol port.

Furthermore, the researchers found that by plotting the geodesic distance between the Internet Protocol (IP) addresses of the sender and receiver–measured on the curved surface of the earth–they could determine whether the message was junk. (Much like every house has a street address, every computer on the Internet has an IP address, and that address can be mapped to a geographic area.) Spam, the researchers found, tends to travel farther than ham. Spammers also tend to have IP addresses that are numerically close to those of other spammers.

Dean Malmgren, a PhD candidate at Northwestern University whose work includes identifying new methods for identifying spam, says he finds the research interesting. But he wonders how robust SNARE will be once its methodology is widely known. IP addresses, he notes, are easy to fake. So, if spammers got wind of how SNARE works, they might, for example, use a fake IP address close to the recipient’s.

The Georgia Tech researchers also looked at the autonomous server (AS) number associated with an e-mail. (An AS number is assigned to every independently operated network, whether it’s an Internet service provider or a campus network.) Knowing that a significant percentage of spam comes from a handful of autonomous server numbers, the researchers decided to integrate that characteristic into SNARE, too.

The end result was a system capable of detecting spam 70 percent of the time, with a 0.3 percent false positive rate. Feamster says that’s comparable to existing spam filters but notes that when used in tandem with existing systems, the process should be far more efficient.

“Consider SNARE a first line of defense,” says Shuang Hao, a PhD candidate in computer science at the Georgia Institute of Technology and a SNARE researcher. Each of the characteristics in the SNARE system contributes to the overall score of an e-mail. So far SNARE has been implemented only in a research environment, but if used in a corporate setting, the network administrator could set rules about what happens to e-mail based on its SNARE score. For example, e-mail that scores poorly could be dropped before it even hits the mail server. Hao says this can save considerable resources, as many companies have a policy that requires they retain a copy of every e-mail that hits the server, whether or not it’s junk. Messages with mediocre scores could be further assessed by traditional content filters.

Hao is currently helping Yahoo improve its spam filter, based on what he’s learned developing SNARE. He says that Cisco has also expressed interest in the work.

“It is fairly clever in the way that they combine a bunch of data that’s cheap to use,” says John Levine, president of the Coalition Against Unsolicited Commercial Email and a senior technical advisor to the Messaging Anti-Abuse Working Group, a consortium of companies involved in fighting spam. “On the other hand, I think some of their conclusions are a bit too optimistic. Spammers are not dumb; any time you have a popular scheme [for identifying spam], they’ll circumvent it.”

The research team will present their work on SNARE at the Usenix Security Conference next month in Montreal. In the future, Feamster hopes to able to apply their findings to other computer security problems, such as phishing e-mails, in which the sender pretends to be from a trusted institution to con recipients into divulging their passwords.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.