Confessions of a Scan Artist

You, too, can commit your life to digital – and throw away your paper records.

Simson Garfinkel ’87, PhD ’05archive page

March 1, 2006

As our lives become more digitized, a number of eminent computer scientists are starting to warn that our most treasured family photos, heartfelt correspondence, and legal documents might be irretrievably lost if we do not print them on acid-free paper and safely store them in a cool, dark, and dry place. After all, the original Declaration of Independence, written on parchment, is still on display in Washington, DC, but digital documents from even 1990-vintage personal computers can be difficult to read, because few people have five-and-a-half-inch floppy-disk drives anymore.

I think those computer scientists have got it wrong. The problem with paper documents is that they are forever vulnerable to destruction – from fire or flood, for example – because they exist in one place. I prefer electronic documents, which can be easily copied and “backed up” to different locations – different hard drives, different buildings, and even different states. And though it does require dedication to manage your life this way, today’s technology makes it easier than ever.

I bought my mother an Apple eMac with a high-speed Internet connection. Every day my family’s digital photo album is copied to her computer. Mom gets to see up-to-the-minute photos of her grandchildren, thanks to Apple’s marvelous screen saver, and I get reliable off-site backup. Other people I know simply send CD-ROMs to their parents every few months. Either way, the ease of making useful off-site backups demonstrates one of digital documents’ real advantages over paper.

Some of the paper documents that show up at my house, like credit card bills, annual tax statements, and even snapshots from my mother’s disposable camera, aren’t as easily rendered into digital form, of course. It’s all too tempting to throw them into a file cabinet or photo box. Moving them into the digital domain takes work; taking the extra step, and throwing away the paper original, used to require an act of faith. But digital documents are worth the effort, and we should all be creating them. These days, it’s relatively easy to understand which formats will survive and be readable in 20 years’ time and which are likely to go the way of the eight-track tape.

The key to survival, it turns out, is openness. File formats that are published and can be implemented without payment of a licensing fee – formats, that is, that embody the principles of open-source software – survive, because knowledge about how to read them can be freely incorporated into many applications. Other file formats die when the companies behind them stumble.

Two modern file formats likely to enjoy long-term durability are the Adobe Acrobat portable document format (PDF) and the JPEG image format. That’s because both of these formats are public, and there is a wide collection of software compatible with them. Yes, the source code for Acrobat itself is proprietary, but PDF files can be directly opened on the Macintosh platform without the use of any Adobe code. They can also be viewed on Linux machines with an open-source program called GhostScript. JPEG, meanwhile, is widely used by millions of digital cameras and practically every computer that’s sold today. I cannot imagine a future computer system that could not read the JPEG file format. Your digital photos are safe – provided that you have good backups.

So when I get a credit-card or bank statement by mail, I usually go to the organization’s website and download a PDF. (I wish these organizations could send the PDFs out by e-mail, but that’s another issue.) But many small organizations provide paper statements only. These, like all of my personal papers, I scan with Fujitsu’s relatively new ScanSnap FI-5110EOX2. I just load a stack of paper into its hopper and press a button. The ScanSnap scans both sides of your paper at the same time and creates a single PDF file. It knows whether you are scanning a black-and-white or color page and can be programmed to automatically remove blank pages from the final PDF.

But scanned PDFs are not hassle free. Not only can different PDFs contain different kinds of information, but they can represent it in different ways. Unlike the typical PDF that you might download from a website, the PDF that a scanner produces is an image, not text, so you can’t index and search it the way you can, say, a Word document. If you want that added functionality, you need to turn the images back into text. This is done through a technology called optical character recognition (OCR).

Many people think of OCR as clunky technology that frequently makes mistakes. Although that’s still true of some OCR engines – most notably, the free engine that ships with some versions of Adobe Acrobat – today’s professional OCR engines, like Abbyy Finereader 8.0, can accurately recognize text in a variety of languages, tables of numbers, and even names. As long as you are using Abbyy Finereader 8.0 or comparable software, you’ll get good results.

Instead of replacing the original image with the recognized text, which could result in data loss if the recognition software makes any mistakes, modern systems store both versions of a document. This means that you can consult the picture of the paper original but use the text for searching and, if you need to, pasting into other documents.

Today’s desktop search engines, like Google Desktop and Apple’s Spotlight, can read the text of the PDF files and automatically index them for you. And because PDF is also an open format with many interoperable implementations, there’s little chance that you won’t be able to read these files in two or three decades.

Personally, I don’t like relying on search to find my documents. Instead, I’ve adopted a file-and-folder system that’s remarkably similar to the one I used to use for paper documents in my file cabinets. When I scan a set of paper documents, I give them a descriptive name, like “2005_bank_statements.pdf.” I then store this file in a folder named “finance,” which I put inside another folder named “2005.” This makes it easy to find a document without searching for it. It also makes it easy to back up my important documents to CD-ROM or to another hard drive.

So is there trouble in this electronic paradise? Yes. For starters, the ScanSnap doesn’t use the industry-standard interface for digital scanners. For reasons known only to Fujitsu, the scanner can be used only with its proprietary scanning software.

And I’ve been burned by electronic documents before. Back in the 1990s, I scanned a lot of articles with a low-quality 200-dots-per-inch scanner and stored them in Visioneer’s proprietary “Max” format. I’m glad I didn’t throw away the originals; recently, I rescanned them all.

But things are different now. Scanners create high-quality images in file formats that are open and widely implemented. For the past two years, I’ve been scanning my papers and throwing away the originals – and I feel good about doing that. On many occasions I’ve had to go back and look things up in my digital files. Documents were easier to find, and once I found them, I could send them off by e-mail.

One of the best reasons for committing to digital storage addresses one of the biggest fears people have about it: the question of whether you’ll regret, in 20 years, having taken the plunge. If we look at the trend in all of the things that we get and store – correspondence, music, photography – what we see is that more and more of what is coming at us is digital on arrival. Do you really expect to get your home heating bill by regular mail in 10 years? Maybe. But by committing to a uniform storage system for all of our personal documents, even if it means, for the moment, having to convert a few hard copies every month to digital files, we are simply giving ourselves a head start on building a single, comprehensive personal library, one whose chief benefit is that it can never burn down.

Fujitsu ScanSnap FI-5110EOX2 Color Duplex Scanner
$495.00

Abbyy FineReader 8.0 Professional
$399.00

Abbyy PDF Transformer
$49.99

Simson Garfinkel is a postgraduate fellow at Harvard University’s Center for Research on Computation and Society.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.