Statistical Tricks Extract Sensitive Data from Encrypted Communications

Research suggests that surveillance agencies could use statistical tricks to peek through the encryption that protects Web browsing.

Tom Simonitearchive page

June 19, 2014

Stung by revelations about mass government surveillance, consumer Web companies are expanding their use of encryption and releasing more details of those protections to reassure wary customers. Earlier this year, for instance, Apple released details of how communications sent via its iMessage service are encrypted.

New research suggests that the U.S. National Security Agency, or any other organization capable of collecting large quantities of Web traffic, could extract private information from encrypted communications by searching for patterns in that data stream. In tests, analysis of encrypted Internet traffic could reveal the health conditions a person was researching online. Similar techniques could glean information about use of iMessage such as when a person starts typing or what language they wrote a message in. That research focuses on an approach known as traffic analysis, which involves using statistical techniques to find patterns in encrypted communications.

Researchers at the University of California, Berkeley, and Intel developed a particularly effective version targeted against HTTPS, the form of encryption used to protect websites and visible to Web surfers as a padlock in a browser’s address bar. The technique involves having software visit the websites of interest and using machine-learning algorithms to learn the traffic patterns associated with different pages. Those patterns are then looked for in a victim’s traffic trace.

The approach proved capable of identifying the pages for specific medical conditions a person was looking at on the Planned Parenthood and Mayo Clinic websites even though both sites encrypt connections with HTTPS. It could also identify what services a person accessed when he or she logged onto financial sites including Wells Fargo and Bank of America. On average, the technique was about 90 percent accurate at identifying Web pages. A paper on the Berkeley research will be presented at the Privacy Enhancing Technologies Symposium in Amsterdam next month.

Traffic analysis would be a useful tool for surveillance by government programs, such as those used by the NSA to collect and analyze encrypted Internet traffic (see “NSA Leak Leaves Crypto Math Intact but Highlights Known Workarounds”). Corporations with access to Internet traffic might also have motivation to use it, says Brad Miller, the PhD candidate at Berkeley who led the research.

“There are very valid use cases of this type of analysis for companies,” he says. For example, an ISP might want to gain information about its customers’ online activity that could be used to target ads, even if those customers have encrypted their browsing or communications. Some ISPs, such as Verizon Wireless, already sell data on their customers’ browsing to third parties for such purposes.

Scott Coull, a researcher with the security company RedJack, says the Berkeley work is the latest in a series of papers showing how traffic analysis could be used against consumers. “When you look at the worst case for this kind of attack, things don’t look very good,” he says.

Coull recently found that traffic analysis can be very effective against messages sent via Apple’s iMessage, which are encrypted from the moment they are sent to the moment they are received. “iMessage is by far the worst thing I’ve seen,” he says. Coull was able to identify when users started or stopped typing, were sending or opening a message, the language a message was written in, and its length, with 96 percent accuracy or higher.

That, combined with the fact that the iMessage protocol transmits a unique identifier for a device, adds up to similar “metadata” to what has been controversially collected by the NSA on U.S. phone calls, says Coull. “If I had the ability to monitor a big chunk of traffic to and from the iMessage servers, I could come up with a social network of whom is messaging whom, and the language they’re using and the approximate size of the messages,” he says.

Coull found that his technique was usually 100 percent effective against messages sent via the popular WhatsApp and Viber mobile messaging services.

Today, few online service providers give much thought to traffic analysis when implementing encryption to protect privacy, says Ashkan Soltani, an independent security researcher who contributed to the Washington Post’s Pulitzer-winning coverage of NSA surveillance. “That concern tends to be focused more in the security community than the actual website operators that implement encryption,” he says.

An operator that did want to defend against traffic analysis would likely find it expensive. Researchers, including Coull and the team at Berkeley, have tested ways to “pad” encrypted data to hide giveaway patterns from traffic analysis, but transferring extra data isn’t free. “It’s incredibly expensive to cover your tracks,” says Coull. He calculates that Apple’s servers would need to transfer three times as much data to mitigate traffic analysis against iMessage.

How a government agency or company might practically use traffic analysis is unclear, though. Coull says he would need to do research using real traffic data, for example from an ISP, to shed light on that question.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.