A Joined-Up Bot-Fighting Strategy
Can simulated handwriting stop the spambots from getting through?
In the battle to beat the spambots, a new weapon has been developed that exploits the difficulty that computers have with recognizing joined-up handwriting. The hope is that switching from text-based verification systems to systems that use computer-generated handwriting will make many Web services more secure.
Developed by researchers at the State University of New York (SUNY), in Buffalo, the system is a variant of a commonly used challenge-response technique called a CAPTCHA (completely automated public Turing test to tell computers and humans apart). This kind of test is designed to be easy for humans but nearly impossible for machines to pass, to prevent automated programs from automatically generating new accounts for nefarious purposes like sending out spam.
Most CAPTCHAs work by displaying images of randomly generated text that has been distorted to make it difficult for optical character recognition (OCR) programs to read, without making it illegible to humans. To pass the test and gain access, users simply reenter the text that they have read.
The trouble is that OCR software is improving steadily, making it possible for spambots to sometimes pass these tests. “It’s an arms race,” says Achint Oommen Thomas, one of the computer scientists who developed the new system. “Every CAPTCHA that exists today has already been broken.”
Just last year, a character-based CAPTCHA developed by Microsoft and used widely for services like Hotmail, MSN, and Windows Live was broken by Jeff Yan and his colleagues at Newcastle University, in the U.K. Microsoft had previously claimed that the CAPTCHA would only let one in 10,000 machine attempts through, but Yan was able to demonstrate that his attack succeeded 60 percent of the time.
Microsoft has since enacted improvements that have made the service much more secure. Even so, Oommen Thomas believes that automatically generating joined-up handwriting could further raise the bar. His system, developed with colleagues Amalia Rusu and Venu Govindaraju, generates words by selecting characters, all handwritten, from a public database of 20,000. Algorithms are then applied to identify important control points within the characters–the key loops and arches that make the letters and numbers recognizable–before other algorithms distort the characters and link them so that they appear joined up. “We distort them randomly but make sure that they are within set limits; otherwise, they become illegible to humans,” says Oommen Thomas.
Publishing their results in the latest issue of the journal Pattern Recognition, the researchers show that some of the best OCR programs can recognize the characters less than 1 percent of the time. “Before a computer can try to recognize a character, it first has to locate it,” Oommen Thomas says, so having characters joined together should make this process (known as segmentation) more challenging.
However, Yan worries that such handwriting could also be much harder for humans to read. “My main concern is usability,” he says. Currently, the system has a human success rate of 75 percent, meaning that one in four times, a human can’t read the text. “That’s way too low,” says Yan.
Luis von Ahn, a computer scientist at Carnegie Mellon University, in Pittsburgh, and a member of the team that first coined the term CAPTCHA, agrees. Von Ahn’s latest system, called reCAPTCHA, has a human success rate of 96 percent. “And still people complain,” he admits.
Oommen Thomas concedes this but says that his team is looking at ways to improve the success rate. “There is a region where humans and machines both do badly, but there is also a sweet spot where humans do well and machines do badly,” he says, and this is what he and his team are now trying to find. “There’s a lot of money to be made circumventing CAPTCHAs to generate spam,” he adds, meaning that spambots are likely to get better and better at breaking existing CAPTCHAs.
“It’s a worthy thing to look at,” says von Ahn, but he is not sure that there’s a need for a completely new kind of CAPTCHA. Systems like reCAPTCHA (currently one of the most widely used systems: it’s running on more than 100,000 websites) are regularly improved to stay ahead of the curve. One trick is to scan in characters from old books, with all their imperfections. “We only use the ones that computers cannot recognize,” von Ahn says. Because of this, reCAPTCHA is extremely good at keeping the bots out, he says, with the best known attacks achieving a success rate of no better than one in 1,000.
“Humans are just not that good at recognizing handwriting,” von Ahn adds, noting that, as we use handwriting less and less in modern life, our ability to recognize squiggly text may fade further still.