Seeking a more perfect form of relief, tens of thousands of users have downloaded open-source filters (most popularly, Spam Assassin) or purchased commercialized versions such as McAfee’s SpamKiller. A collection of statistically valid rules created by humans, these “heuristic” filters stand guard at the user’s in-box and scan every incoming message for tip-off terms such as “Viagra,” “V1AGRA,” or even “V*I*A*G*R*A,” plus improbable return addresses, strange symbols, embedded graphics, and fraudulent routing information, indicating the message is of dubious origins. After applying hundreds of rules, the filter scores each message, discarding those whose scores exceed a threshold value. Spam Assassin and SpamKiller typically exhibit filtration rates higher than 95 percent and false-positive rates of about .1 percent, according to Matt Sergeant of MessageLabs, a maker of Spam Assassin improvements.
This relatively high false-positive rate, however, is troubling to some users. After all, much legitimate e-mail has some of the same traits as spam. Sergeant concedes that newsletters that were requested by users will occasionally be discarded. That flaw has led to novel solutions such as collaborative filters, in which users vote as to which messages should be deemed spam.
SpamNet, from San Francisco-based Cloudmark, is one example of a program that deploys democracy in this way. An add-on to Microsoft’s Outlook e-mail program, SpamNet starts filtering spam automatically upon installation. If enough trusted users designate a message as spam, that message ends up in the spam folders of Cloudmark’s entire base of 420,000 users. “When a new person joins, they get the benefit of the community,” says Vipul Ved Prakash, Cloudmark’s founder and chief scientist. False positives are rarer under this approach, and users also have the option of clicking “unblock” on any messages in their spam folders. But there are drawbacks: SpamNet demands a higher level of user vigilance, and it requires that Cloudmark’s remote servers examine all incoming e-mail before passing it on.
To fend off spam that penetrates other defenses, computer scientists have turned to the 18th-century probability theory of English mathematician Thomas Bayes. Published in 1763, two years after his death, Bayes’s “Essay towards Solving a Problem in the Doctrine of Chances” provides a blueprint for determining the likelihood of future events. Since one person’s spam can be another person’s invitation to a pleasurable afternoon, Bayesian spam filters learn over time what each individual considers unwanted e-mail. When a user deletes several unopened messages about mortgage refinancing, for instance, a Bayesian filter learns to discard e-mail with that kind of terminology. If you typically do read such messages, however, the filter will take note of that and consider it normal e-mail.
Because Bayesian filters can be trained, their effectiveness improves over time, typically attaining filtration rates of 99.8 percent, along with a false-positive rate of a mere .05 percent. “If everyone’s filter has different probabilities of different messages getting through, it makes it harder for the spammers,” says Paul Graham, an independent Cambridge, MA, programmer. Last August, a link to Graham’s article “A Plan for Spam” on slashdot.org jump-started a rush to Bayesian filtering. These kinds of filters, Graham says, will break the business model of the spammer. It costs about $200, he continues, to send one million messages-an endeavor that typically yields about 100 responses. If those 100 people spend an average of $2 each, the spammer breaks even. The goal, Graham says, is to drive response rates down to around one in a million so that “it would no longer be economical for a spammer to consider such a business proposition.”
Microsoft Research has taken this probabilistic approach even further. Standard, so-called nave Bayesian filters treat each word or feature in an e-mail independently, but Microsoft claims its new filter, which is offered as an option in MSN 8 software, learns probabilities for words, phrases, and other distinguishing characteristics that commonly appear together. It might flag messages containing the phrase “make money from home” and “click here” that are sent from servers based in Hong Kong and that have random characters in the subject line. Microsoft’s Heckerman claims that, by correlating patterns, his filter exhibits an even lower rate of false positives.
The monkey wrench is that spam is not an inanimate adversary, but rather a tool of wily and willful humans. In fact, the very effectiveness of spam filters may actually be making the problem worse. If half of a batch of spam gets thrown into the digital garbage can, the spammer will tend to respond by sending twice as much spam the next time. “As you put more filters in place, spammers become more determined, and the spam will increase,” says the Anti Spam Research Group’s Judge, who is the chief technology officer at CipherTrust, an Alpharetta, GA-based provider of e-mail security systems.
To balance the higher volume, Judge says, spammers simply find ways to lower their costs, such as enlisting servers based in China or India, where labor is cheap. What’s more, as spammers mount a counterattack against Bayesian methods, spam is tending to look more and more like non-spam. For example, a message that says, “Hi Jim, have you seen the party pictures-take a look!” may not raise red flags, because it doesn’t contain any obvious spam terms. When spam begins to look exactly like messages from friends and colleagues, filters may fail.