Skip to Content

Internet Archaeologists Reconstruct Lost Web Pages

Computer scientists have developed a technique for reconstructing missing web resources from the context in which they appeared, just like archaeologists in the physical world

The internet is disappearing. And with it goes an important part of our recorded history. That was the conclusion of a study this blog looked at last year, which measured the rate at which links shared over social media platforms such as Twitter, were disappearing.

The conclusion was that this data is being lost at the rate of 11 per cent within a year and 27 per cent within two years.

Today, the researchers behind this work reveal that all is not lost.  Hany SalahEldeen and Michael Nelson at Old Dominion University in Norfolk, Virginia, have found a way to reconstruct deleted material and say it works reasonably well.

First, some background. These guys began their work by studying the thousands of tweets, blog posts and other resources that were published during the 18 days of uprising in the Egyptian revolution in 2011. These resources were important, they say, because they provide a valuable record of an historic event.

However, they also discovered that some of these posts and others on the web were disappearing and began to measure the rate at which they were vanishing. Hence the numbers given above.

The new work is their attempt to reconstruct these missing posts and resources, at least in part, from the clues they leave behind on the web.

SalahEldeen and Nelson began by attempting to confirm the earlier results and that threw up a surprise. “An interesting phenomena occurred as several of the resources that were previously declared as missing became available again,” they say.

That’s possible if the original disappearance was the result of a disrupted domain or archive that was later restored, or a user account that had been suspended and later reinstated.

So SalahEldeen and Nelson wondered how it might be possible to find this resurrected material, even when it is no longer in its original cyber neighbourhood. They point out that most shared resources leave traces elsewhere on the web, such as retweets, hashtags, comments and so on.

The idea that SalahEldeen and Nelson came up with was to attempt to reconstruct a missing resource by searching for the traces left on the web. For that, they used the Twitter search engine Topsy, which allows them to enter the address of a missing resource and returns the tweets that refer to it. This is the resource’s “tweet signature”.

They then extract the top five most frequent terms in this signature and use them as a search query in Google.  The result is a list of potential replacements for the lost resource.

An important question, of course, is how closely the replacement candidates match the original resource. To test this, SalahEldeen and Nelson carried out the same process for resources that had not disappeared and then compared the replacement candidates with the originals. They say the replacements had a 70% textual similarity to the original resource about 40% of the time.

Not perfect, of course, but better than nothing. And perhaps given time it will become possible to do better.

What’s interesting is that this process is a kind of internet archaeology that reconstructs an historical web page from the context in which it occurred. That’s a fascinating new discipline.

In the real world, archaeologists and anthropologists have become highly skilled at reconstructing natural history in this way. The conclusions that can be drawn from the discovery and analysis of a single tooth, for example, are truly astounding.

There’s no reason why internet archaeologists cannot become just as skilled.

Ref: arxiv.org/abs/1309.2648: Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web

Keep Reading

Most Popular

Europe's AI Act concept
Europe's AI Act concept

A quick guide to the most important AI law you’ve never heard of

The European Union is planning new legislation aimed at curbing the worst harms associated with artificial intelligence.

Uber Autonomous Vehicles parked in a lot
Uber Autonomous Vehicles parked in a lot

It will soon be easy for self-driving cars to hide in plain sight. We shouldn’t let them.

If they ever hit our roads for real, other drivers need to know exactly what they are.

supermassive black hole at center of Milky Way
supermassive black hole at center of Milky Way

This is the first image of the black hole at the center of our galaxy

The stunning image was made possible by linking eight existing radio observatories across the globe.

transplant surgery
transplant surgery

The gene-edited pig heart given to a dying patient was infected with a pig virus

The first transplant of a genetically-modified pig heart into a human may have ended prematurely because of a well-known—and avoidable—risk.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.