We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Intelligent Machines

Putting the Web in a Spreadsheet

A new tool can be used to collect, analyze, and visualize large quantities of data.

Vast quantities of data are freely available on the Web, and it can be a potential treasure trove for many businesses–providing they can figure out how to use it effectively.

Sorting and filter: BigSheets lets users analyze unstructured data from the Web using tools similar to those found in desktop spreadsheet software.

A company can, for example, comb through data from the U.S. Patent and Trademark Office and court records prior to acquiring another company to see if any of its intellectual property is tied up in legal action. In practice, however, going through so much information takes time and effort to orchestrate.

IBM hopes that a new tool, called BigSheets, will help users analyze Web data more easily. The company has developed a test version of the software for the British Library.

“The ability of any user to do their own types of interesting analytics is coming of age,” says Rod Smith, vice president of emerging Internet technologies for IBM.

BigSheets is built on top of another piece of software called Hadoop. This is an open-source platform for processing very large amounts of Web data by splitting up tasks and handing them off to a cluster of different computers. Hadoop is often used to analyze large amounts of unstructured Web data.

BigSheets uses Hadoop to crawl through Web pages, parsing them to extract key terms and other useful data. BigSheets organizes this information in a very large spreadsheet, where users can analyze it using the sort of tools and macros found in desktop spreadsheet software. Unlike ordinary spreadsheet software, however, there’s no limit to the size of a spreadsheet created through BigSheets.

To use BigSheets, a user would point the tool at a set of URLs or a repository of data. Lists of terms can be used to organize the data into rows and tables, and these can be adjusted later.

Smith says that IBM chose the spreadsheet as the model for organizing data because most users are already familiar with such software. If users want to represent the data in more complex ways, the tool will work with an IBM visualization tool called Many Eyes, as well as other visualization software.

Tag team: Visualization tools can be used with BigSheets to find patterns in unstructured data.

BigSheets has “a level of integration that I haven’t seen,” says Ben Lorica, a senior analyst in the research group at the technical publishing company O’Reilly Media. Traditionally, Lorica says, companies have split the functions that BigSheets performs into three separate tasks–Web crawling, data analysis, and visualizations. Because BigSheets is built on Hadoop, which is fundamentally designed to work on enormous quantities of data, Lorica says, “scale is not a problem” for BigSheets.

He cautions, however, that BigSheets is at an early stage and needs to be tested with other data. Since the technology is being developed in conjunction with particular partners of IBM, it’s unclear how easy it would be for a company start using it, he says. Setting up a Hadoop cluster can be a demanding task, he says, and if BigSheets isn’t packaged well, companies may find themselves needing an army of consultants to prepare the way for the tool.

The first test for BigSheets came at the British Library, which has been working since 2004 to create an archive of the roughly eight million UK websites. At regular intervals, the Library takes snapshots of Web pages, converts them to an archival file format, and stores them. But searching and analyzing this data is another challenge, and that’s where BigSheets came in.

In less than eight hours, Smith says, his team took 4.5 terabytes of archive files and processed them using a Hadoop cluster of four machines. With guidance from British Library researchers, the team used BigSheets to extract keywords, author information, and other metadata from these unstructured Web pages. They experimented with term frequency analysis and ran tag clouds and other visualizations.

The British Library researchers were able to adjust the kinds of metadata they were interested in over the course of the first day, focusing more on who had authored pages than they originally intended. Visualizations provided new insights. For example, using a tag cloud, the researchers discovered that the name of British political figure and writer Alastair Campbell was often misspelled as “Alistair,” surfacing large numbers of relevant records that could easily have been overlooked.

Eytan Adar, an assistant professor of information and computer science at the University of Michigan, who researches Internet-scale systems, text mining, and visualization, says that the tool could have a big impact. “Although the British Library’s content seems restricted to a few snapshots for each page, this still translates to a ton of data, and simply dumping search results in response to a query isn’t useful,” Adar says.

Adar has designed his own tool, called Zoetrope, for analyzing how Web pages have changed over time. BigSheets brings new insights, he says, by comparing data from many different pages as well as over time. Adar says that effective visualizations are “crucial for letting users quickly understand large collections of data.”

After further testing, IBM hopes to incorporate BigSheets into its existing services and products.

Want to go ad free? No ad blockers needed.

Become an Insider
Already an Insider? Log in.
More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Insider Basic.
  • Insider Basic {! insider.prices.basic !}*

    {! insider.display.menuOptionsLabel !}

    Six issues of our award winning print magazine, unlimited online access plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    Print Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.