The “real-time Web” is a hot concept these days. Both Google and Microsoft are racing to add more real-time information to their search results, and a slew of startups are developing technology to collect and deliver the freshest information from around the Web.
But there’s more to the real-time Web than just microblogging posts, social network updates, and up-to-the-minute news stories. Huge volumes of data are generated, behind the scenes, every time a person watches a video, clicks on an ad, or performs just about any other action online. And if this user-generated data can be processed rapidly, it could provide new ways to tailor the content on a website, in close to real time.
Many Web companies already use analytics to optimize their content throughout the course of a day. Some online news sites will, for example, tweak the layout on their home page by monitoring the popularity of different articles. But traditionally, information has been collected, stored, and then analyzed afterward. Using seconds-old data to tailor content automatically is the next step. In particular, a lot of the information generated in real-time relates to advertising. A few startup companies are developing technologies to process this data rapidly.
Sailesh Krishnamurthy, vice president and cofounder of the data-analysis company Truviso, based in Foster City, CA, points to the hundreds of billions of data points created each day through the delivery of online video. “If you think of each one of those hits and the associated advertisements being served by those hits,” he says, “then it’s this complex ecosystem of companies serving the ads, managing the ads, companies trying to figure out metrics. It’s pretty amazing to think that just that one user interaction leads to this explosion of activity happening under the covers.”
Real-time data analysis has its roots in the financial markets, but Ben Lorica, a senior analyst in the research group at O’Reilly Media, believes that Web companies will want to optimize ads, video, and multimedia campaigns as fast as possible. He adds that services that deliver Web content instantly make the approach relevant to the end users, too. “As people realize that they can push content out and others will start consuming it in real-time, then people will also naturally want the reporting of how that is being consumed in real-time,” he says.
Truviso and another startup, StreamBase, based in Lexington, MA, have created software to process real-time analytics data. Both companies were spun out of university research aimed at processing real-time data from sensor networks, such as those used to monitor environmental conditions. Richard Tibbetts, CTO of StreamBase, explains that financial markets make up about 80 percent of his company’s customers today. Web companies are just starting to adopt the technology.
“You’re going to see real-time Web mashups, where data is integrated from multiple sources,” Tibbetts says. Such a mashup could, for example, monitor second-to-second fluctuations in the price of airline tickets and automatically purchase one when it falls below a certain price.
Truviso recently launched a feature that allows users to calculate unique visitors to a website in real time. This has historically been a difficult problem because several steps must be performed each time to make sure the user is really distinct. Both StreamBase and Truviso rely on accessing conventional, structured databases. Lorica sees potential for real-time analysis of unstructured data–a set of numbers found scattered through a paragraph of text rather than formatted in a chart.
Software frameworks, such as Hadoop and Google’s MapReduce, which process large amounts of Web data using large numbers of computers, are often used to analyze unstructured data. Recent research from Yahoo and the University of California, Berkeley also promises to make these frameworks work in real-time, too.
Joseph Hellerstein, a UC Berkeley professor of computer science who was involved with this work, explains that the key was to find a way to make Hadoop and MapReduce faster and more interactive without compromising their ability to protect data.
Real-time applications, whether using traditional database technology or Hadoop, stand to become much more sophisticated going forward. “When people say real-time Web today, they have a narrow view of it–consumer applications like Twitter, Facebook, and a little bit of search,” says StreamBase’s Tibbetts.