The Chinese Solar Machine Layer by Layer Fire in the Library The Mystery Behind Anesthesia
(Page 2 of 2)
Tag team: Visualization tools can be used with BigSheets to find patterns in unstructured data.
IBM
BigSheets has "a level of integration that I haven't seen," says Ben Lorica, a senior analyst in the research group at the technical publishing company O'Reilly Media. Traditionally, Lorica says, companies have split the functions that BigSheets performs into three separate tasks--Web crawling, data analysis, and visualizations. Because BigSheets is built on Hadoop, which is fundamentally designed to work on enormous quantities of data, Lorica says, "scale is not a problem" for BigSheets.
He cautions, however, that BigSheets is at an early stage and needs to be tested with other data. Since the technology is being developed in conjunction with particular partners of IBM, it's unclear how easy it would be for a company start using it, he says. Setting up a Hadoop cluster can be a demanding task, he says, and if BigSheets isn't packaged well, companies may find themselves needing an army of consultants to prepare the way for the tool.
The first test for BigSheets came at the British Library, which has been working since 2004 to create an archive of the roughly eight million UK websites. At regular intervals, the Library takes snapshots of Web pages, converts them to an archival file format, and stores them. But searching and analyzing this data is another challenge, and that's where BigSheets came in.
In less than eight hours, Smith says, his team took 4.5 terabytes of archive files and processed them using a Hadoop cluster of four machines. With guidance from British Library researchers, the team used BigSheets to extract keywords, author information, and other metadata from these unstructured Web pages. They experimented with term frequency analysis and ran tag clouds and other visualizations.
The British Library researchers were able to adjust the kinds of metadata they were interested in over the course of the first day, focusing more on who had authored pages than they originally intended. Visualizations provided new insights. For example, using a tag cloud, the researchers discovered that the name of British political figure and writer Alastair Campbell was often misspelled as "Alistair," surfacing large numbers of relevant records that could easily have been overlooked.
Eytan Adar, an assistant professor of information and computer science at the University of Michigan, who researches Internet-scale systems, text mining, and visualization, says that the tool could have a big impact. "Although the British Library's content seems restricted to a few snapshots for each page, this still translates to a ton of data, and simply dumping search results in response to a query isn't useful," Adar says.
Adar has designed his own tool, called Zoetrope, for analyzing how Web pages have changed over time. BigSheets brings new insights, he says, by comparing data from many different pages as well as over time. Adar says that effective visualizations are "crucial for letting users quickly understand large collections of data."
After further testing, IBM hopes to incorporate BigSheets into its existing services and products.
Manufacturing in the United States is in trouble. That's bad news not just for the country's economy but for the future of innovation.
This document is part of the “How-To Guide for Most Common Measurements” centralized resource portal. This tutorial provides a detailed guide for measurement and device considerations to take temperature measurements using thermocouples. Get an introduction to thermocouples, which are inexpensive sensing devices widely used with PC-based data acquisition systems. Also review some specific thermocouple examples and learn how thermocouples work and ways to integrate them into a data acquisition measurement system.
View full PDF >
carlhage
84 Comments
Garbage in Spreadsheet Out?
The concept of using spreadsheets with the web sounded intriguing to me. It's something I really need (see below). But I went to the IBM demo, got the glitzy 3D sphere of floating words, I see "Party" (the words were about the election) so I click that. Then I get the party party party as first of "Your search returned 5005080 instances". The demo seems to show a big step backwards. (Real search engines actually manage to give meaningful results to this search, though it's not easy to give it a context like politics or elections.)
There is lots of really valuable information available on the net, much of which fits in a spreadsheet, but is now only accessible via simplistic google-type word searches. For example, last weekend I was reading Google.org's energy plan with lots of references. I was doing some calculations earlier trying to find various statistics and physical constants. The net has all the data you need-- cost for generating electricity for different types (capital and fuel), per capita electric use, etc., but the only way to find it is via text search, then tedious copy and paste.
I wanted to find the latent heat of salt, e.g. for solar thermal storage. Wikipedia is great-- it has a table on the latent heat page, and a table on CO2 emissions by fuel type, but there isn't a good way to extract this "knowledge" into something usable as would be a spreadsheet. After an absurd amount of time, I think I gave up trying to find that physical constant. (I just thought of looking for that in my CRC handbook, but I didn't find it there either. But the net could be like a superpowered CRC Handbook, since the data is there, but probably not where today's search engines can find it.)
Likewise, the Wikipedia page on Sodium Chloride has lots of nice physical constants (e.g. molecular weight, boiling point (but not latent heat). A search engine could, in theory, analyze Wikipedia and form a spreadsheet extracted from page content for various materials.
The US EIA has lots of tables of energy statistics, many published as spreadsheets. But I don't know of any internet search tools that can operate on this data with any kind of semantic context, e.g. what is the per-household energy usage by state? Sometimes statistics are available but units need conversions or a multiplication with physical constants are needed. A web-in-a-spreadsheet could identify all this net-based information and make it usable like spreadsheet tables.
There are sites like statemaster.com and nationmaster.com that repackage public domain statistics for certain kinds of data, but that's not the same as a general search tool with some concept of semantics, or a tool that can manipulate across tables as does the units conversion software.
If I were doing a research project on net searching, one thing I might consider is to build tool sets that could identify tabular information in web-crawled documents, then find the semantics from table headings, etc. to construct searchable data sets.
The Google search has a built in calculator that can convert units, etc. Imagine a next step where it could know about physical constants and statistics. Wow!
Ultimately, what is needed is a way for authors to tag the semantics of data presented, and to encourage authors to tag tables and spreadsheets with this semantics. Besides applications like science, it could also be used in commerce, e.g. to tag prices and specifications for products in web catalogs. But short of this kind of tagging, perhaps there are some auto-interactive approaches.
My hope is that this rant will prompt someone to attempt some real semantic kind of net data mining, not just a glitzy animated ball of a handfull of search words. [For glitz, see the movie Disclosure, where Michael Douglas had to don virtual reality glasses to access the database It's hard to beat Hollywood when it comes to glitz in UIs.]
Reply
ms
190 Comments
Re: Garbage in Spreadsheet Out?
See http://en.wikipedia.org/wiki/Semantic_Web
Reply