Cultural genome project mines Google Books for the secret history of humanity

What exactly is the half-life of celebrity? A new field called culturomics has the answer. Using the largest linguistic database ever created - Google Books - culturomics experts track things like "lexical dark matter," and how long fame really lasts.

Earlier today, we spoke to the two main minds behind the cultural genome, Harvard researchers Jean-Baptiste Michel and Erez Aiden. Both come from multidisciplinary backgrounds - they're both part of the Program for Evolutionary Dynamics, but Michel is also a member of the psychology and systems biology departments, and Aiden is also part of the mathematics department and school of engineering, just to name a few. Indeed, the entire research team encompasses a huge variety of backgrounds, including several more Harvard departments, Google, and Encyclopedia Britannica.

The cultural genome

We asked them how they got to work on what they're calling the "cultural genome", a massive database that acts a digitized record of human culture through time. They explained that they wanted to track the evolution of culture quantitatively. There has been work done in this field before, particularly with how irregular verbs transform over time and what that can reveal about the subtleties of cultural change. This, however, was small scale, time-consuming work that could only be done by manually going through books to track the changes. This was, they admitted, a big pain in the neck. There had to be a better way.

That's where Google entered the picture. Through Google Books, the search engine giant has been digitizing huge swaths of books, including older books that practically nobody has looked at in over a century. Sensing an opportunity for something major, they asked Google for access to their database for the purpose of scientific research. Google quickly got on board, and so now Michel, Aiden, and their co-researchers could automatically track how words change over time in the largest database of the written word ever assembled. That's how the cultural genome was created.

The sheer scale of the enterprise is hard to imagine. This cultural genome is many thousands time larger than any previous corpus or database, including 4% of everything ever published. There's a thousand times more letters in the cultural genome than there are DNA base pairs in the human genome. Writing the entire corpus out in a single line would reach to the Moon and back ten times over. It would take eighty years to read just the works from the year 2000, and that's assuming you never stopped to eat, drink, or sleep.

How many grams of words do you have?

So with all that data at that disposal, what did they discover? In order to manage the data, they looked at words and phrase as n-grams. Any single word was a 1-gram, a two-word phrase like "Wall Street" or "blue whale" was a 2-gram, a three-word phrase like "Los Angeles Lakers" or "laws of robotics" was a 3-gram, and so on. They restricted the study 1-grams through 5-grams, and then looked for any n-grams that appeared more than forty times in the entire corpus.

The cultural genome is a powerful way to understand how we use words changes over times, and the ways in which those changes happen. As they explain in their paper:

Usage frequency is computed by dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year. For instance, in 1861, the 1-gram "slavery" appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books. The corpus contains 386,434,758 words from 1861; thus the frequency is 5.5x10^-5. "slavery" peaked during the civil war (early 1860s) and then again during the civil rights movement (1955-1968).

In contrast, we compare the frequency of "the Great War" to the frequencies of "World War I" and "World War II." "the Great War" peaks between 1915 and 1941. But although its frequency drops thereafter, interest in the underlying events had not disappeared; instead, they are referred to as "World War I."

These examples highlight two central factors that contribute to culturomic trends. Cultural change guides the concepts we discuss (such as "slavery"). Linguistic change – which, of course, has cultural roots – affects the words we use for those concepts ("the Great War" vs. "World War I").

The English language boom

These aren't exactly new concepts, but the cultural genome allows us to look at things we already know in new ways. It also reveals just how much we don't know. Allowing for numbers, misspellings and foreign words, the researchers estimated there were 544,000 words in the English lexicon in 1900, 597,000 in 1950, and 1,022,000 in 2000. These days, 8,500 completely new words enter the lexicon every year, fueling a 70% growth in the size of our language over the last fifty years.

As you can see, the English language is in a period of booming expansion, but all three years reveal a surprising fact: there are, and always have been, way more words in the lexicon than in dictionaries. They explained to us that dictionaries have trouble finding low-frequency words, estimating the cut-off frequency at about one use for every billion words. If a word's presence in the English language is less than 1 part per billion, then dictionaries probably won't pick up on them. They estimate that a whopping 63% of all the different words found in the cultural genome fall below that lowest frequency cut-off.

Although a segment of these words would find their way into dictionaries, Baptiste-Michel and Aiden estimate 52% of all words used in English books over the last 500 years are "lexical dark matter" that go undocumented in dictionaries and other references. Their paper provides some examples of this dark matter:

Part of this gap is because dictionaries often exclude proper nouns and compound words ("whalewatching"). Even accounting for these factors, we found many undocumented words, such as "aridification" (the process by which a geographic region becomes dry), "slenthem" (a musical instrument), and, appropriately, the word "deletable."

The rise and fall of celebrity

But the cultural genome doesn't just tell us the story of words - it can tell the story of people. By looking for which names show up most often in the genome, they are able to track the rise and fall of celebrity. They divided people into "classes" based on the years in which they were born, and then tracked when mentions of them in the lexicon reached a crucial tipping point.

In what is likely not a surprise to anyone, the average age of fame is getting younger and younger. Celebrities born in 1800 did not, on average, become famous until they were 43, compared to 29 for people born in 1950. But they explained to us that the arc of fame has also accelerated, and people sink back into obscurity much, much faster now than they did before.

The "doubling time" of celebrity, in which people reach twice the level of their initial fame, has sped up from 8.1 years in 1800 to 3.3 years in 1950, but the half-life of celebrity is rapidly decreasing. People from the early 19th century enjoyed 120 years of continued lexical fame once they achieved celebrity status, whereas people from the late 19th century got only 71 years.

Of course, part of that is we have a far wider definition of what constitutes celebrity than those from 200 years ago. Actors tend to become famous around the age of 30, writers become famous around 40, and politicians often have to wait until they're 50. In 2010, we've got far more celebrity actors than politicians, particularly compared to the early nineteenth century before the rise of mass media. For what it's worth, politicians tend to have the last laugh, as the most famous leaders achieve far greater and more lasting fame than their acting counterparts.

But it's not just people that we forget - it's the past itself. Let's consider the half-life of fame for actual years. References to the year 1880 didn't reach half their initial frequency until 32 years later, in 1912. 1973, on the other hand, had already reached its half-life by 1983, only ten years later. And that trend is only likely to increase - the more we have to say about ourselves in the present, the less time there is to consider the past.

How censors erase people from history

Not all of these changes are unconscious, either. The cultural genome can reveal a lot about censorship and how people are written out of history. In one particularly striking example, the Jewish artist Marc Chagall saw his fame in the English lexicon increase fivefold between 1936 and 1944. As for Nazi Germany? In all those eight years, there is one reference to Chagall in the entire available German lexicon, an astonishing display of how power over language can write people out of history.

Michel and Aiden told us that certain groups of people get excised far more often than others. For instance, the Nazis blacklisted many different groups of people they judged to be dissidents, among them historians, political scientists, philosophers, and artists. Interestingly, blacklisted historians only see a very modest decline in their status in the German lexicon, whereas philosophers and artists almost drop out entirely, suggesting the Nazis felt far more threatened by free-thinking and creativity than by the past.

Nazis weren't the only group who controlled the lexicon. Leon Trotsky is almost completely suppressed from the Russian lexicon, while Tiananmen Square goes nearly unmentioned in the Chinese lexicon. The United States isn't innocent either - the "Hollywood Ten", a group of entertainers accused of communist sympathies in 1947, also disappear from the English lexicon, despite any additional infamy the accusations might have brought.

The cultural genome can also pick up on little things we might not otherwise think about. For instance, "Freud" is far more embedded in our collective subconscious than "Galileo", "Darwin", or "Einstein", at least if mentions in the lexicon are anything to go by. I asked Michel and Aiden about this, and they explained that it's not really a reflection on how people view their scientific work.

Rather, it shows how Freud has entered our everyday lexicon in the form of "Freudian slips" and other examples of passing pop psychology. Until people can so seamlessly integrate evolution or relativity into their everyday experience, Freud is likely to maintain his advantage.

From "save the country" to "save the world"

I asked them about other findings that surprised or intrigued them while going through the research. In explaining the sheer scope of what they're able to explore, they pointed to one intriguing lexical shift that suggests maybe the humanity really did learn a lesson from World War II:

"We were just amazed at the extent to which huge numbers of concepts, phrases, expressions, that people say repeatedly. You can easily study the dynamics of "save the world" vs. "save the country." You find that since World War II people have shifted from "save the country" to "save the world", and we were shocked to see how much the dynamic had shifted. Statistics can predict a shift like that, but we can look at it in really fine-grained detail."

So what's next for the cultural genome? Well, we live in a digital age, so it's only going to grow more and more rapidly over the next 50 and 100 years as new books are written and older books get digitized for the first time. In the meantime, you can play around with their database and go searching for different n-grams yourself in Google Labs. (For the sake of science, please do one serious search before you go looking for dirty words.) It's all available at culturomics.org, and you can read their entire paper over at Science.

Image via Abundance Tapestry.