How to Delete Regrettable Posts from the Internet

It’s possible—though not always foolproof—to get embarrassing things taken down. Voluntary data-labeling standards could make it even easier.

Simson Garfinkel ’87, PhD ’05archive page

October 30, 2012

illsutration. man dumping files from computer It might seem that the Internet doesn’t lose track of anything that has been published online. The alleged permanence of tweets, blogs, snapshots, and instant messages worries many privacy activists and policymakers such as Viviane Reding, justice commissioner of the European Union and vice president of the European Commission. She has proposed that Europe adopt a “right to be forgotten”—a proposal that is now working its way through the EU legal process and could be law within two years.

Reding’s proposal would grant EU citizens the right to withdraw their consent from online information services after the fact—allowing people to redact embarrassing things from the global information commons, even after the data had been copied to other websites. It’s a controversial proposal: George Washington University law professor Jeffrey Rosen wrote in the Stanford Law Review that such a right could have deeply negative implications for both free speech and journalism and could ultimately fragment the Internet. Rosen pointed out that companies like Google would need to suppress from European search queries information that had been deemed “forgotten” on the continent, even as such information would still be perfectly allowable in the United States.

The proposal might also be unnecessary. Even without a right to be forgotten, there are still many ways that information can be removed from the Web. Such methods could be made more widespread.

Somewhat surprisingly, the easiest information to remove from the Internet may be data stored in Facebook, and to a lesser extent in other social networks. Facebook’s “Statement of Rights and Responsibilities” says that any information a Facebook user uploads to the social network remains that user’s property—posting, liking, and otherwise interacting with Facebook merely gives the service a revocable license to the data. That license ends when the data are deleted.

Wiping away those embarrassing self-portraits you took and posted when you were drunk won’t delete the copies that your friends have saved on their own hard drives. But who makes copies of photos anymore? Here’s a way that the convenience of cloud-based services works in favor of privacy controls: they give you one-stop-shopping for information oblivion, a single place to go and get something deleted.

Facebook was created to make it easy for people to share their personal data—and as a result, people often share information without even realizing it. But Facebook also makes it easy to clean up after yourself. If you put your phone number in your profile, that number might get copied to your friend’s cell phones through Facebook’s application programming interface (API). But if you delete your phone number from your Facebook profile, that same API should go through your friends’ phones and remove your information as well. That’s because Facebook’s developer guidelines prohibit programs that access Facebook from making permanent copies of your personal information: software is only allowed to make a “cache” copy in order to improve performance, but that copy must be linked back so that it can be kept up to date. Such license terms, designed to keep developers dependent on Facebook, have the side effect of enforcing a privacy policy that’s surprisingly pro-consumer.

It’s not necessarily difficult to have information removed from Twitter, either. Even though the company’s privacy policy warns “what you say on Twitter may be viewed all around the world instantly,” Twitter lets users delete their own tweets. You can delete other people’s tweets if you are willing to swear out a complaint that the tweets violate the Digital Millennium Copyright Act—that’s how big media companies get Twitter to take down links pointed at copyrighted material. (In a nod to transparency, Twitter makes those requests public on the Chilling Effects website at http://chillingeffects.org/twitter.) Most of us don’t have copyright claims to make, but Twitter will also take down tweets that contain harassing or private information, including credit card numbers, Social Security numbers, addresses, phone numbers, and e-mail addresses. Although it’s possible that someone has made a copy, in many cases removing information effectively sends it down the memory hole.

Yahoo and other websites have similar forms for requesting that information be taken down. They do this even though they generally are not required to by U.S. law. Advertising-funded websites make so little money off any individual piece of data that it’s much easier to take information down than to spend time fighting for the rights of the person who posted the data.

Back in 2005, I met a person who had been the victim of horrible harassment a few years earlier, in high school. Even years later, this colleague of mine was still haunted by a series of harassing websites that her tormentors had put up on free Web-hosting services. My colleague was too traumatized to deal with the issue, so I sent a few e-mails to the Web-hosting companies, and within a few days the offending material had been taken down. Today a search for the person’s name yields only professional results, not those teenage pranks.

Unfortunately, wiping data away from every cranny of the Internet can be challenging. Consider my colleague. If you know where to look, it’s still possible to find those harassing pages. They don’t show up in Google or Bing, but there are copies hidden away at the Internet Archive, a website that seeks to preserve most of the Internet’s content for posterity. There are procedures for removing data from the Internet Archive, but those procedures generally require the active participation of the current holder of the Web domain. Fortunately for my colleague, the Internet Archive’s pages aren’t indexed by Google or Bing, so except for those people who know specifically where to look, the information is invisible.

In fact, it’s hard to imagine a system that could index all of the world’s information thoroughly enough to allow someone exercising the “right to be forgotten” to track down and eradicate every regrettable message or photo. More likely, the mechanisms to find that data would cause more privacy violations than they would prevent.

A better solution could be a set of standards for labeling the provenance of information on the Internet. It would be somewhat like the way Facebook requires application developers to keep checking back to see whether personal information is still acceptable to use. It would also take advantage of the privacy-protecting steps that other sites like Twitter and Yahoo sometimes are willing to take for their users.

This could be done using the HTML microdata standard being developed. It is still evolving, but this standard will expand the ways that information in Web pages can be represented in their underlying HTML code. For example, the microdata could include tags designed to facilitate privacy tracking and the retraction of privacy-sensitive information. So if you persuaded a website to take down information because it violates the site’s terms of service, that website could automatically notify others that have made copies of your information, informing them that the license to use the data has been revoked.

Such voluntary technical measures would go a long way toward improving the situation that policymakers hope to fix with a legal right to be forgotten.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.