Emerging Technology from the arXiv

A View from Emerging Technology from the arXiv

Internet Archaeologists Reconstruct Lost Web Pages

Computer scientists have developed a technique for reconstructing missing web resources from the context in which they appeared, just like archaeologists in the physical world

  • September 18, 2013

The internet is disappearing. And with it goes an important part of our recorded history. That was the conclusion of a study this blog looked at last year, which measured the rate at which links shared over social media platforms such as Twitter, were disappearing.

The conclusion was that this data is being lost at the rate of 11 per cent within a year and 27 per cent within two years.

Today, the researchers behind this work reveal that all is not lost.  Hany SalahEldeen and Michael Nelson at Old Dominion University in Norfolk, Virginia, have found a way to reconstruct deleted material and say it works reasonably well.

First, some background. These guys began their work by studying the thousands of tweets, blog posts and other resources that were published during the 18 days of uprising in the Egyptian revolution in 2011. These resources were important, they say, because they provide a valuable record of an historic event.

However, they also discovered that some of these posts and others on the web were disappearing and began to measure the rate at which they were vanishing. Hence the numbers given above.

The new work is their attempt to reconstruct these missing posts and resources, at least in part, from the clues they leave behind on the web.

SalahEldeen and Nelson began by attempting to confirm the earlier results and that threw up a surprise. “An interesting phenomena occurred as several of the resources that were previously declared as missing became available again,” they say.

That’s possible if the original disappearance was the result of a disrupted domain or archive that was later restored, or a user account that had been suspended and later reinstated.

So SalahEldeen and Nelson wondered how it might be possible to find this resurrected material, even when it is no longer in its original cyber neighbourhood. They point out that most shared resources leave traces elsewhere on the web, such as retweets, hashtags, comments and so on.

The idea that SalahEldeen and Nelson came up with was to attempt to reconstruct a missing resource by searching for the traces left on the web. For that, they used the Twitter search engine Topsy, which allows them to enter the address of a missing resource and returns the tweets that refer to it. This is the resource’s “tweet signature”.

They then extract the top five most frequent terms in this signature and use them as a search query in Google.  The result is a list of potential replacements for the lost resource.

An important question, of course, is how closely the replacement candidates match the original resource. To test this, SalahEldeen and Nelson carried out the same process for resources that had not disappeared and then compared the replacement candidates with the originals. They say the replacements had a 70% textual similarity to the original resource about 40% of the time.

Not perfect, of course, but better than nothing. And perhaps given time it will become possible to do better.

What’s interesting is that this process is a kind of internet archaeology that reconstructs an historical web page from the context in which it occurred. That’s a fascinating new discipline.

In the real world, archaeologists and anthropologists have become highly skilled at reconstructing natural history in this way. The conclusions that can be drawn from the discovery and analysis of a single tooth, for example, are truly astounding.

There’s no reason why internet archaeologists cannot become just as skilled.

Ref: arxiv.org/abs/1309.2648: Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web

Get stories like this before anyone else with First Look.

Subscribe today
Already a Premium subscriber? Log in.

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Connectivity

What it means to be constantly connected with each other and vast sources of information.

Want more award-winning journalism? Subscribe to Insider Premium.
  • Insider Premium {! insider.prices.premium !}*

    {! insider.display.menuOptionsLabel !}

    Our award winning magazine, unlimited access to our story archive, special discounts to MIT Technology Review Events, and exclusive content.

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

    Access to the Magazine archive. Over 24,000 articles going back to 1899 at your fingertips.

    Special Discounts to select partner offerings

    Discount to MIT Technology Review events

    Ad-free web experience

    First Look. Exclusive early access to stories.

    Insider Conversations. Listen in as our editors talk to innovators from around the world.

/
You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.