Culturomics and the Google Book Project

The digitization of over five million books has created a huge dataset of cultural interest. Now researchers are beginning to tease it apart using powerful number-crunching techniques.

Emerging Technology from the arXivarchive page

February 27, 2012

Plot the distribution over time of events like earthquakes, forest fires, wars, fashions, epidemics and so on and you’ll get a power law, a mathematical relationship in which frequency depends on some other quantity raised to a certain power.

That’s a puzzling and profound discovery. It means that these real world phenomena occur in patterns that defy standard statistical analysis. It becomes meaningless, for example, to describe the ‘average’ time between forest fires or the ‘average’ size of an earthquake or the number of deaths in an ‘average’ war. The notion of average simply doesn’t make sense.

One of the key factors driving work in this area is the availability of large datasets that scientists can crunch to look for power laws. That’s relatively straightforward to get for many natural phenomenon such as earthquakes and forest fires, at least in recent history.

But traditionally, social phenomena have been much harder to measure. That’s started to change in the last few years thanks to the vast databases built up since the advent of the internet. The study of the web has revealed no end of power laws in human behaviour and more recently, the data associated with mobile phone use has begun to reveal large scale patterns of human mobility and social intercourse.

Now the Google Book project triggered a new area of investigation. This program has scanned the contents of some 5 million books from 40 university libraries around the world. That’s about 4 per cent of all the books ever published.

Last year, the Google books team and a few others published the first study of this database. These guys examined the occurrence of n-grams, sequences of a certain number of words, this number denoted by n.

They concluded that their approach can produce unique insights into fields such as the evolution of grammar, the adoption of technology, the pursuit of fame, censorship and historical epidemiology. They even coined a word to describe this new area of science that spans the humanities and social sciences–culturomics.

Today, Jianbo Gao at the Wright State University in Dayton, Ohio, and a few pals, take Google’s idea and run with it. Or at least jog a few steps.

These guys have used the Google books database to study the occurrences of two types of words: those that describe natural phenomena such as earthquakes and hurricanes; and those that describe social phenomena such as war and unemployment.

In particular, they look at a mathematical quantity called the Hurst parameter which describes the likelihood of an event occurring again given that it has occurred in the past. The Hurst parameter is a measure of the way the past influences the future, whether the ‘memory’ of an event is long term or short term in the data.

In general, they say, words describing natural phenomena seem to appear in a way that has a long term memory while words describing social phenomena have only a short term memory effect.

That’s interesting because it implies that the forces at work that determine when the occurrence of these words are different. “Our analysis suggests…that social phenomena tend to follow different scaling laws than natural phenomena.”

That makes sense and it’s interesting to see the result come out of a single body of data.

But there are numerous complexities that make the result hard to fathom. Gao and co suggest that the use of words is driven by the occurrence of events themselves. So the word earthquake is used more often after a big earthquake, for example.

But there are other social phenomena at work that can mask this effect, such as the way news spreads, the use of censorship, cultural taboos and so on. It may be that these have a much more powerful effect on the use of certain words.

That makes it hard to draw strong conclusions from this study.

But it does make it all the more appealing to carry out more powerful analyses to tease apart the various cultural effects that are at work. Culturomics is clearly a discipline with a future, albeit one that hard to fathom for the time being.

Ref: arxiv.org/abs/1202.5299: Culturomics Meets Random Fractal Theory: Insights Into Long-Range Correlations Of Ssocial And Natural Phenomena Over The past Two Centuries

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.