Skip to Content
Uncategorized

Culturomics and the Google Book Project

The digitization of over five million books has created a huge dataset of cultural interest. Now researchers are beginning to tease it apart using powerful number-crunching techniques.

Plot the distribution over time of events like earthquakes, forest fires, wars, fashions, epidemics and so on and you’ll get a power law, a mathematical relationship in which  frequency depends on some other quantity raised to a certain power.

That’s a puzzling and profound discovery. It means that these real world phenomena occur in patterns that defy standard statistical analysis. It becomes meaningless, for example, to describe the ‘average’ time between forest fires or the ‘average’ size of an earthquake or the number of deaths in an ‘average’ war. The notion of average simply doesn’t make sense.

One of the key factors driving work in this area is the availability of large datasets that scientists can crunch to look for power laws. That’s relatively straightforward to get for many natural phenomenon such as earthquakes and forest fires, at least in recent history.

But traditionally, social phenomena have been much harder to measure. That’s started to change in the last few years thanks to the vast databases built up since the advent of the internet. The study of the web has revealed no end of power laws in human behaviour and more recently, the data associated with mobile phone use has begun to reveal large scale patterns of human mobility and social intercourse. 

Now the Google Book project triggered a new area of investigation. This program has scanned the contents of some 5 million books from 40 university libraries around the world. That’s about 4 per cent of all the books ever published. 

Last year, the Google books team and a few others published the first study of this database. These guys examined  the occurrence of n-grams, sequences of a certain number of words, this number denoted by n. 

They concluded that their approach can produce unique insights into fields such as the evolution of grammar, the adoption of technology, the pursuit of fame, censorship and historical epidemiology. They even coined a word to describe this new area of science that spans the humanities and social sciences–culturomics.

Today, Jianbo Gao at the Wright State University in Dayton, Ohio, and a few pals, take Google’s idea and run with it. Or at least jog a few steps.

These guys have used the Google books database to study the occurrences of two types of words: those that describe natural phenomena such as earthquakes and hurricanes; and those that describe social phenomena such as war and unemployment.

In particular, they look at a mathematical quantity called the Hurst parameter which describes the likelihood of an event occurring again given that it has occurred in the past. The Hurst parameter is a measure of the way the past influences the future, whether the ‘memory’ of an event is long term or short term in the data. 

In general, they say, words describing natural phenomena seem to appear in a way that has a long term memory while words describing social phenomena have only a short term memory effect. 

That’s interesting because it implies that the forces at work that determine when the occurrence of these words  are different. “Our analysis suggests…that social phenomena tend to follow different scaling laws than natural phenomena.”

That makes sense and it’s interesting to see the result come out of a single body of data. 

But there are numerous complexities that make the result hard to fathom. Gao and co suggest that the use of words is driven by the occurrence of events themselves. So the word earthquake is used more often after a big earthquake, for example.

But there are other social phenomena at work that can mask this effect, such as the way news spreads, the use of censorship, cultural taboos and so on. It may be that these have a much more powerful effect on the use of certain words. 

That makes it hard to draw strong conclusions from this study.

But it does make it all the more appealing to carry out more powerful analyses to tease apart the various cultural effects that are at work. Culturomics is clearly a discipline with a future, albeit one that hard to fathom for the time being.

Ref: arxiv.org/abs/1202.5299: Culturomics Meets Random Fractal Theory: Insights Into Long-Range Correlations Of Ssocial And Natural Phenomena Over The past Two Centuries

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

How scientists traced a mysterious covid case back to six toilets

When wastewater surveillance turns into a hunt for a single infected individual, the ethics get tricky.

The problem with plug-in hybrids? Their drivers.

Plug-in hybrids are often sold as a transition to EVs, but new data from Europe shows we’re still underestimating the emissions they produce.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.