Who’s Using Your Data?

New Web technology would let you track how your private data is used online.

Larry Hardestyarchive page

August 19, 2014

By now, most people trust the cryptographic schemes that protect online financial transactions, but inadvertent misuse of our data by people authorized to access it remains a pressing concern as more of it moves online.

Tim Berners-Lee, Oshani Seneviratne, SM ’09, and Lalana Kagal collaborated to develop HTTPA.

At the same time, tighter restrictions on access could undermine the whole point of sharing data. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory believe the solution may be more transparency. To that end, they’re developing a protocol they call “HTTP with Accountability,” or HTTPA, which will automatically monitor the transmission of private data and allow the data owner to examine how it’s being used.

With HTTPA, remote access to a Web server would be controlled much the way it is now, through passwords and encryption. But every time the server transmitted a piece of sensitive data, it would also send a description of the restrictions on the data’s use. And it would log the transaction at multiple points across a network of encrypted, special-purpose servers.

“It’s not that difficult to transform an existing website into an HTTPA-aware website,” says Oshani Seneviratne, SM ’09, a graduate student in electrical engineering and computer science who developed the protocol with her advisor, Tim Berners-Lee, and Lalana Kagal, a principal research scientist at CSAIL. “On every HTTP request, the server should say ‘Okay, here are the usage restrictions for this resource’ and log the transaction in the network of special-purpose servers.”

Seneviratne uses a technology known as distributed hash tables—the technology at the heart of peer-to-peer networks like BitTorrent—to distribute the transaction logs among the servers. Redundant storage of the same data on multiple servers ensures that if some servers go down, data will remain accessible. It also provides a way to detect data tampering: a server whose logs differ from those of its peers would be easy to ferret out.

To test the system, Seneviratne built a rudimentary health-care records system from scratch and filled it with data supplied by 25 volunteers. She then simulated a set of data transfers corresponding to events that the volunteers reported as having occurred over the course of a year—pharmacy visits, referrals to specialists, use of anonymized data for research purposes, and the like.

In experiments involving 300 servers on the experimental network PlanetLab, the system efficiently tracked down data stored across the network and handled the chains of inference necessary to audit its propagation among multiple providers. In practice, Seneviratne says, audit servers could be maintained by a grassroots network, much like the servers that host BitTorrent files or log Bitcoin transactions.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.