I never before believed the claims of companies that sell optical character recognition (OCR) software-those programs that turn scans of printed pages into editable text. That’s because I know how to multiply. When the companies claimed “99 percent accuracy,” I translated that to roughly one error on every line. And that, in my opinion, was unacceptable.
Then last fall, I had a revelatory experience. I was trying to sell an old book of dolls on eBay, so I scanned a page and told Microsoft Office Document Scanning to save it as an image. When the program prompted me for a file name, the name it suggested corresponded to the scanned page’s headline.
Funny, I didn’t remember typing in that headline.
On a whim, I saved the page as a Microsoft Word document. Then I opened the file and compared it to the original. There was not a single error in five paragraphs of text. Optical character recognition software has done a lot of growing up in the past ten years. A lot of people don’t realize just how good the technology has become because they base their impressions on the free software that comes with scanners rather than on professional-grade software, which costs hundreds of dollars. But after I wrote last year about how I was scanning all of my old documents into PDF files (see “Slaying the Paper Dragon,” TR October 2003), a publicist at Abbyy USA sent me the company’s FineReader 7.0 corporate edition.
My test was simple: I fed the software my bank statement. FineReader spent 15 seconds scanning the page and another 30 turning the image into text.
Modern character recognition systems use a variety of mathematical techniques to perform their magic. Imaging algorithms remove speckles and rotate the page so that it’s “straight.” Then a series of algorithms separates out each glyph, determines the likelihood that any glyph is a particular letter, and consults a dictionary to come up with a probable word. The software can also decide to accept a word that’s not in the dictionary, if the image looks good and there are no obvious close matches.