Internet Archaeologists Reconstruct Lost Web Pages

Computer scientists have developed a technique for reconstructing missing web resources from the context in which they appeared, just like archaeologists in the physical world

Emerging Technology from the arXivarchive page

September 18, 2013

The internet is disappearing. And with it goes an important part of our recorded history. That was the conclusion of a study this blog looked at last year, which measured the rate at which links shared over social media platforms such as Twitter, were disappearing.

The conclusion was that this data is being lost at the rate of 11 per cent within a year and 27 per cent within two years.

Today, the researchers behind this work reveal that all is not lost. Hany SalahEldeen and Michael Nelson at Old Dominion University in Norfolk, Virginia, have found a way to reconstruct deleted material and say it works reasonably well.

First, some background. These guys began their work by studying the thousands of tweets, blog posts and other resources that were published during the 18 days of uprising in the Egyptian revolution in 2011. These resources were important, they say, because they provide a valuable record of an historic event.

However, they also discovered that some of these posts and others on the web were disappearing and began to measure the rate at which they were vanishing. Hence the numbers given above.

The new work is their attempt to reconstruct these missing posts and resources, at least in part, from the clues they leave behind on the web.

SalahEldeen and Nelson began by attempting to confirm the earlier results and that threw up a surprise. “An interesting phenomena occurred as several of the resources that were previously declared as missing became available again,” they say.

That’s possible if the original disappearance was the result of a disrupted domain or archive that was later restored, or a user account that had been suspended and later reinstated.

So SalahEldeen and Nelson wondered how it might be possible to find this resurrected material, even when it is no longer in its original cyber neighbourhood. They point out that most shared resources leave traces elsewhere on the web, such as retweets, hashtags, comments and so on.

The idea that SalahEldeen and Nelson came up with was to attempt to reconstruct a missing resource by searching for the traces left on the web. For that, they used the Twitter search engine Topsy, which allows them to enter the address of a missing resource and returns the tweets that refer to it. This is the resource’s “tweet signature”.

They then extract the top five most frequent terms in this signature and use them as a search query in Google. The result is a list of potential replacements for the lost resource.

An important question, of course, is how closely the replacement candidates match the original resource. To test this, SalahEldeen and Nelson carried out the same process for resources that had not disappeared and then compared the replacement candidates with the originals. They say the replacements had a 70% textual similarity to the original resource about 40% of the time.

Not perfect, of course, but better than nothing. And perhaps given time it will become possible to do better.

What’s interesting is that this process is a kind of internet archaeology that reconstructs an historical web page from the context in which it occurred. That’s a fascinating new discipline.

In the real world, archaeologists and anthropologists have become highly skilled at reconstructing natural history in this way. The conclusions that can be drawn from the discovery and analysis of a single tooth, for example, are truly astounding.

There’s no reason why internet archaeologists cannot become just as skilled.

Ref: arxiv.org/abs/1309.2648: Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.