Technology Review - Published By MIT
Advertisement
[1] 2 Next »

May 2004

The Paper Killer

Finally, character recognition software that can reliably scan paper documents-and let you get rid of them.

By Simson Garfinkel

smaller text tool iconmedium text tool iconlarger text tool icon

I never before believed the claims of companies that sell optical character recognition (OCR) software-those programs that turn scans of printed pages into editable text. That's because I know how to multiply. When the companies claimed "99 percent accuracy," I translated that to roughly one error on every line. And that, in my opinion, was unacceptable.

Then last fall, I had a revelatory experience. I was trying to sell an old book of dolls on eBay, so I scanned a page and told Microsoft Office Document Scanning to save it as an image. When the program prompted me for a file name, the name it suggested corresponded to the scanned page's headline.

Funny, I didn't remember typing in that headline.

On a whim, I saved the page as a Microsoft Word document. Then I opened the file and compared it to the original. There was not a single error in five paragraphs of text. Optical character recognition software has done a lot of growing up in the past ten years. A lot of people don't realize just how good the technology has become because they base their impressions on the free software that comes with scanners rather than on professional-grade software, which costs hundreds of dollars. But after I wrote last year about how I was scanning all of my old documents into PDF files (see "Slaying the Paper Dragon," TR October 2003), a publicist at Abbyy USA sent me the company's FineReader 7.0 corporate edition.

My test was simple: I fed the software my bank statement. FineReader spent 15 seconds scanning the page and another 30 turning the image into text.

Modern character recognition systems use a variety of mathematical techniques to perform their magic. Imaging algorithms remove speckles and rotate the page so that it's "straight." Then a series of algorithms separates out each glyph, determines the likelihood that any glyph is a particular letter, and consults a dictionary to come up with a probable word. The software can also decide to accept a word that's not in the dictionary, if the image looks good and there are no obvious close matches.

[1] 2 Next »
May 2004

Would you like to read more articles from the May 2004 issue?

This article is from the May 2004 Issue of Technology Review. To read other articles from this issue simply register for My.TechnologyReview.com. It's free.

Subscribe today and save up to 41% »

Comments

Advertisement

Current Issue

Technology Review November/December 2008
Sun + Water = Fuel
An MIT chemist has opened the way to making hydrogen fuel from water using sunlight.
•  Subscribe
Save 41%
•  Table of Contents
•  MIT News

Magazine Services

Career Resources

MIT Technology Insider

Stories and breaking news from inside MIT about the latest research, innovations, and startups--in a convenient monthly e-newsletter. Subscribe today
Advertisement

Follow us on Twitter

Twitter

Get Technology Review updates via the web, cellphone, or Instant Messager – Follow techreview on Twitter!

Advertisement

More Technology News from Forbes

Advertisement
Advertisement
TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology