We Are the Words

A powerful new tool can help quantify the evolution of human culture by capturing how often we use words and phrases.

Stephen Ornesarchive page

December 17, 2010

Taking a cue from the techniques of genomics, a team of researchers has devised a tool that delivers quantitative data on how culture changes over time. Genomics research analyzes huge amounts of data to study how genes function and change; the new tool takes a large-scale approach to study the frequency of word usage over time.

The approach makes sense if words are considered a unit of culture, says Erez Lieberman Aiden, one of the project’s leaders. “The genome contains heritable information, passed from generation to generation,” he says. “The words we use, in the books we write, are also passed from generation to generation.”

Lieberman Aiden and Jean-Baptiste Michel, both at Harvard’s Program for Evolutionary Dynamics, led the project, which they’ve dubbed “culturomics”—a portmanteau combining “culture” and “genomics.” The first fruit of their labors was a mammoth database of the words in about 5.2 million books published between 1800 and 2000—roughly four percent of all published books. These came from the Google Books project, whose library contains 15 million books.

In today’s issue of the journal Science, the researchers introduce their project along with some of the first results they’ve derived from the data. In connection with the publication, Google is rolling out an application (at www.culturomics.org) that allows anyone to access and analyze the finished database, which includes 2 billion words and phrases.

The researchers say that by tracking the frequency of word use, social scientists, computer scientists, and mathematicians can observe the emergence and evolution of cultural trends over time. The tool can be used to create timelines of culture, showing spikes and valleys corresponding to heavy and scant use of particular words.

Suppression, for example, leaves a mark on cultural history. German-language books published under Nazi censorship between 1936 and 1944 barely mention certain artists and philosophers whose names were common before and after that period.

The analyses also identified words that existed in published books but had no home in dictionaries, including “aridification” (the drying out of a region) and “deletable.” These untethered words are no exception: When the researchers totted up all the words in the English lexicon, they counted more than a million—twice the number in large modern dictionaries. (The Oxford English Dictionary, for example, has fewer than 500,000 entries.)

Lieberman Aiden says he hopes researchers from many disciplines will find new ways to exploit the data. “It’s another tool at the disposal of humanists to gather insight and answer questions about human nature.”

He and Michel began working seriously on the project in 2007. Not all of the books in Google’s digital library are in the public domain, so the researchers had to be careful not to infringe copyright law. In essence, they removed the words from the context of the books—while keeping metadata like publication date intact—and organized the words into an enormous frequency table.

They applied filters to make their data set as accurate as possible, weeding out, for example, books with incorrect publication dates or those whose text was poorly transcribed by optical-character-recognition software. After filtering, they were left with 5,195,769 books, containing text more than 500 billion words in length. About 72 percent of those are English words.

The intensive computations required to narrow that data set into one based on the frequency of each word were distributed over multiple machines at Google and completed quickly.

Jon Kleinberg, a computer scientist at Cornell University, says that word frequency can be a powerful quantitative tool for identifying trends in culture. “Looking at behavior of individual words can often be a strong first indicator of a phenomenon across time,” he says. Scanned materials, however, are just the beginning. Other digital texts provide rich sources for the quantitative study of cultural information. For example, the analysis of Google search terms can reveal what interests people. Or a large-scale study of Facebook updates can serve as a real-time pulse check on the masses.

“We’re seeing things that were never written down before,” he says. “On Twitter or Facebook, millions of people are saying ‘I’m feeling happy’ or ‘I’m feeling sad.’ Until the last 10 years, where could you have found millions of people writing down their feelings?”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.