Skip to Content
Uncategorized

We Are the Words

A powerful new tool can help quantify the evolution of human culture by capturing how often we use words and phrases.
December 17, 2010

Taking a cue from the techniques of genomics, a team of researchers has devised a tool that delivers quantitative data on how culture changes over time. Genomics research analyzes huge amounts of data to study how genes function and change; the new tool takes a large-scale approach to study the frequency of word usage over time.

The approach makes sense if words are considered a unit of culture, says Erez Lieberman Aiden, one of the project’s leaders. “The genome contains heritable information, passed from generation to generation,” he says. “The words we use, in the books we write, are also passed from generation to generation.”

Lieberman Aiden and Jean-Baptiste Michel, both at Harvard’s Program for Evolutionary Dynamics, led the project, which they’ve dubbed “culturomics”—a portmanteau combining “culture” and “genomics.” The first fruit of their labors was a mammoth database of the words in about 5.2 million books published between 1800 and 2000—roughly four percent of all published books. These came from the Google Books project, whose library contains 15 million books.

In today’s issue of the journal Science, the researchers introduce their project along with some of the first results they’ve derived from the data. In connection with the publication, Google is rolling out an application (at www.culturomics.org) that allows anyone to access and analyze the finished database, which includes 2 billion words and phrases.

The researchers say that by tracking the frequency of word use, social scientists, computer scientists, and mathematicians can observe the emergence and evolution of cultural trends over time. The tool can be used to create timelines of culture, showing spikes and valleys corresponding to heavy and scant use of particular words.

Suppression, for example, leaves a mark on cultural history. German-language books published under Nazi censorship between 1936 and 1944 barely mention certain artists and philosophers whose names were common before and after that period.

The analyses also identified words that existed in published books but had no home in dictionaries, including “aridification” (the drying out of a region) and “deletable.” These untethered words are no exception: When the researchers totted up all the words in the English lexicon, they counted more than a million—twice the number in large modern dictionaries. (The Oxford English Dictionary, for example, has fewer than 500,000 entries.)

Lieberman Aiden says he hopes researchers from many disciplines will find new ways to exploit the data. “It’s another tool at the disposal of humanists to gather insight and answer questions about human nature.”

He and Michel began working seriously on the project in 2007. Not all of the books in Google’s digital library are in the public domain, so the researchers had to be careful not to infringe copyright law. In essence, they removed the words from the context of the books—while keeping metadata like publication date intact—and organized the words into an enormous frequency table.

They applied filters to make their data set as accurate as possible, weeding out, for example, books with incorrect publication dates or those whose text was poorly transcribed by optical-character-recognition software. After filtering, they were left with 5,195,769 books, containing text more than 500 billion words in length. About 72 percent of those are English words.

The intensive computations required to narrow that data set into one based on the frequency of each word were distributed over multiple machines at Google and completed quickly.

Jon Kleinberg, a computer scientist at Cornell University, says that word frequency can be a powerful quantitative tool for identifying trends in culture. “Looking at behavior of individual words can often be a strong first indicator of a phenomenon across time,” he says. Scanned materials, however, are just the beginning. Other digital texts provide rich sources for the quantitative study of cultural information. For example, the analysis of Google search terms can reveal what interests people. Or a large-scale study of Facebook updates can serve as a real-time pulse check on the masses.

“We’re seeing things that were never written down before,” he says. “On Twitter or Facebook, millions of people are saying ‘I’m feeling happy’ or ‘I’m feeling sad.’ Until the last 10 years, where could you have found millions of people writing down their feelings?”

Deep Dive

Uncategorized

Five poems about the mind

DREAM VENDING MACHINE I feed it coins and watch the spring coil back,the clunk of a vacuum-packed, foil-wrappeddream dropping into the tray. It dispenses all kinds of dreams—bad dreams, good dreams,short nightmares to stave off worse ones, recurring dreams with a teacake marshmallow center.Hardboiled caramel dreams to tuck in your cheek,a bag of orange dreams…

Work reinvented: Tech will drive the office evolution

As organizations navigate a new world of hybrid work, tech innovation will be crucial for employee connection and collaboration.

lucid dreaming concept
lucid dreaming concept

I taught myself to lucid dream. You can too.

We still don’t know much about the experience of being aware that you’re dreaming—but a few researchers think it could help us find out more about how the brain works.

panpsychism concept
panpsychism concept

Is everything in the world a little bit conscious?

The idea that consciousness is widespread is attractive to many for intellectual and, perhaps, also emotional
reasons. But can it be tested? Surprisingly, perhaps it can.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.