How Sheer Tonnage Rendered Information Unfree

The apparent cheapness of information was only a temporary effect of the dawn of the Internet age.

Christopher Mimsarchive page

June 14, 2012

Kent Anderson, an academic publisher and former executive at the New England Journal of Medicine, has written a comprehensive defense of the idea that digital goods are not inherently free.

There is a persistent conceit stemming from the IT arrogance we continue to see around us, but it’s one that most IT professionals are finding real problems with — the notion that storing and distributing digital goods is a trivial, simple matter, adds nothing to their cost, and can be effectively done by amateurs.

While Anderson runs through a number of cogent points – everything from the need to secure digital goods to the demands of cataloging them – it struck me that the real issue with digital goods is that they have become so easy to create that their sheer volume makes their management costly.

A prime example are the issues faced by the Library of Congress, which seeks to archive every Tweet ever. As Audrey Waters outlined at O’Reilly Radar:

What makes the endeavor challenging, if not the size of the archive, is its composition: billions and billions and billions of tweets. […]

Each tweet is a JSON file, containing an immense amount of metadata in addition to the contents of the tweet itself: date and time, number of followers, account creation date, geodata, and so on. To add another layer of complexity, many tweets contain shortened URLs, and the Library of Congress is in discussions with many of these providers as well as with the Internet Archive and its 301works project to help resolve and map the links.

What we’re seeing with digital goods is a classic economic effect known as rebound. When technology increases the efficiency of resource use, we tend to simply consume more of a given resource. (This is one of the effects that bedevils our attempts to lower our energy consumption through efficiency.)

Anderson points out that “some data sets [like genomic data] are propagating at a rate that exceeds Moore’s Law.” (Consider also that we’re approaching the end of Moore’s Law.)

With the advent of the Internet of Things, data is becoming “an effect of just living,” which means an additional accelerant on our need to store and manage data.

It seems that the apparent cheapness of information was only a temporary effect of the beginning of the Internet age. As we transferred analogue media produced through labor intensive processes to the web, we discovered that it was many times larger than we needed it to be, and also many times cheaper to get data to end users. Now that the machines themselves are throwing off so much data, and all of us are producing many times what we once did, information once again has a cost.

h/t Hanna Waters

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.