So when I get a credit-card or bank statement by mail, I usually go to the organization’s website and download a PDF. (I wish these organizations could send the PDFs out by e-mail, but that’s another issue.) But many small organizations provide paper statements only. These, like all of my personal papers, I scan with Fujitsu’s relatively new ScanSnap FI-5110EOX2. I just load a stack of paper into its hopper and press a button. The ScanSnap scans both sides of your paper at the same time and creates a single PDF file. It knows whether you are scanning a black-and-white or color page and can be programmed to automatically remove blank pages from the final PDF.
But scanned PDFs are not hassle free. Not only can different PDFs contain different kinds of information, but they can represent it in different ways. Unlike the typical PDF that you might download from a website, the PDF that a scanner produces is an image, not text, so you can’t index and search it the way you can, say, a Word document. If you want that added functionality, you need to turn the images back into text. This is done through a technology called optical character recognition (OCR).
Many people think of OCR as clunky technology that frequently makes mistakes. Although that’s still true of some OCR engines – most notably, the free engine that ships with some versions of Adobe Acrobat – today’s professional OCR engines, like Abbyy Finereader 8.0, can accurately recognize text in a variety of languages, tables of numbers, and even names. As long as you are using Abbyy Finereader 8.0 or comparable software, you’ll get good results.
Instead of replacing the original image with the recognized text, which could result in data loss if the recognition software makes any mistakes, modern systems store both versions of a document. This means that you can consult the picture of the paper original but use the text for searching and, if you need to, pasting into other documents.
Today’s desktop search engines, like Google Desktop and Apple’s Spotlight, can read the text of the PDF files and automatically index them for you. And because PDF is also an open format with many interoperable implementations, there’s little chance that you won’t be able to read these files in two or three decades.
Personally, I don’t like relying on search to find my documents. Instead, I’ve adopted a file-and-folder system that’s remarkably similar to the one I used to use for paper documents in my file cabinets. When I scan a set of paper documents, I give them a descriptive name, like “2005_bank_statements.pdf.” I then store this file in a folder named “finance,” which I put inside another folder named “2005.” This makes it easy to find a document without searching for it. It also makes it easy to back up my important documents to CD-ROM or to another hard drive.
So is there trouble in this electronic paradise? Yes. For starters, the ScanSnap doesn’t use the industry-standard interface for digital scanners. For reasons known only to Fujitsu, the scanner can be used only with its proprietary scanning software.