Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo


Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

So when I get a credit-card or bank statement by mail, I usually go to the organization’s website and download a PDF. (I wish these organizations could send the PDFs out by e-mail, but that’s another issue.) But many small organizations provide paper statements only. These, like all of my personal papers, I scan with Fujitsu’s relatively new ScanSnap FI-5110EOX2. I just load a stack of paper into its hopper and press a button. The ScanSnap scans both sides of your paper at the same time and creates a single PDF file. It knows whether you are scanning a black-and-white or color page and can be programmed to automatically remove blank pages from the final PDF.

But scanned PDFs are not hassle free. Not only can different PDFs contain different kinds of information, but they can represent it in different ways. Unlike the typical PDF that you might download from a website, the PDF that a scanner produces is an image, not text, so you can’t index and search it the way you can, say, a Word document. If you want that added functionality, you need to turn the images back into text. This is done through a technology called optical character recognition (OCR).

Many people think of OCR as clunky technology that frequently makes mistakes. Although that’s still true of some OCR engines – most notably, the free engine that ships with some versions of Adobe Acrobat – today’s professional OCR engines, like Abbyy Finereader 8.0, can accurately recognize text in a variety of languages, tables of numbers, and even names. As long as you are using Abbyy Finereader 8.0 or comparable software, you’ll get good results.

Instead of replacing the original image with the recognized text, which could result in data loss if the recognition software makes any mistakes, modern systems store both versions of a document. This means that you can consult the picture of the paper original but use the text for searching and, if you need to, pasting into other documents.

Today’s desktop search engines, like Google Desktop and Apple’s Spotlight, can read the text of the PDF files and automatically index them for you. And because PDF is also an open format with many interoperable implementations, there’s little chance that you won’t be able to read these files in two or three decades.

Personally, I don’t like relying on search to find my documents. Instead, I’ve adopted a file-and-folder system that’s remarkably similar to the one I used to use for paper documents in my file cabinets. When I scan a set of paper documents, I give them a descriptive name, like “2005_bank_statements.pdf.” I then store this file in a folder named “finance,” which I put inside another folder named “2005.” This makes it easy to find a document without searching for it. It also makes it easy to back up my important documents to CD-ROM or to another hard drive.

So is there trouble in this electronic paradise? Yes. For starters, the ScanSnap doesn’t use the industry-standard interface for digital scanners. For reasons known only to Fujitsu, the scanner can be used only with its proprietary scanning software.

6 comments. Share your thoughts »

Tagged: Web

Reprints and Permissions | Send feedback to the editor

From the Archives


Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me