Analysis Of Culture Through Text Analysis
Culturomics is defined as “the application of high-throughput data collection and analysis to the study of human culture” (Michel et al., 2011) (7). The resources used in this analysis are books, newspapers, manuscripts, maps, work if art, and other cultural artifacts (Michel et al., 2011). Culturomic analyses involve millions of books simultaneously, analogous to distant reading in literary studies.
One of the first projects to address cultural questions that can be gleaned from large-scale analysis of printed material was the construction of a large corpus from over five million books to quantitatively analyze cultural trends. The work was performed by a collaborative team of researchers from Harvard University, Google, and elsewhere. The corpus contains over 500 billion words in seven languages – English, German, Hebrew, Chinese, French, Spanish, and Russian.
This analysis was enabled by computational methods related to distant reading, where computational methods are applied to a large body of material. These pre-processing, processing, and analysis steps were performed computationally because of the scale of the tasks and the vast volume of textual material. The pre-processing consisted of the following cleaning steps:
- Digitized books were obtained from the Google Books project, started in 2004, in which collections of over 40 large libraries, as well as books provided directly by publishers, have been digitized. As of 2011, there were 15 million volumes that have been digitized.
- The publication dates for the materials were corrected as needed.
- Volumes were selected based on an OCR (optical character recognition) quality score of the materials.
- Language detection, correction, and filtering were applied.
- The selection and filtering process continued through assessment of metadata field.
The result of the pre-processing is a cleaned source corpus. Processing consisted in several standard text analysis steps:
- Tokenization.
- N-gram construction.
- Statistical counting and aggregation by date of publication.
The result of the processing is the corpus of historical n-grams. A 1-gram is defined as a string of characters uninterrupted by a space or other punctuation. 1-grams include single words (including misspelled ones), abbreviations, and numbers. A sequence of 1-grams (or unigrams) comprises an n-gram, where n denotes the number of 1-grams. Thus, for instance, “barometric pressure” is a 2-gram (n = 2), known as a bigram or diagram, and the phrase “sleight of hand” is a 3-gram (n = 3), also called a trigram. For larger n, the n-gram is usually denoted by n, followed by the suffix “gram”, such as four-gram or ten-gram. In the research described here, n was restricted to 5, and only n-grams occurring at least 40 times in the corpus were considered. Usage frequency is a per-year statistic. It is the fraction of occurrences of an n-gram to the total number of words in a corpus for a particular year.
Results of the Research
Subsequent analysis was performed on the resulting n-grams. One useful statistic was the mean percentage of overall frequency for words and phrases by year, based on different criteria. The authors use the example of the word “treaty”, and plot their mean percentage of overall frequency against time (increasing years), relative to treaty signature. Heads of state were analyzed in the same way, relative to their accession to power. Even changes in the names of countries were analyzed.
Other statistics included the change in the number of words used over time and comparisons of forms of words (e.g. “found” vs. “finded”, “dwelt” vs. “dwelled”, “throve/thriven” vs. “thrived”). Because the corpus is in digital form, this analysis could also be performed on different query criteria, such as country (e.g. comparison of “spilt” vs. “spilled” in the U.S. and the U.K.). Straightforward statistics such as word frequency also reveal cultural trends. The authors illustrate the dynamics of the frequencies of the words “Fort Sumter” (peaking during the U.S. Civil War), “Lusitania” (peaking in the World War I period), and “Pearl Harbor” (peaking in the World War II period). How people are referenced were also analyzed; for instance, the dynamic (time-varying) trends for the usage of “Henry David Thoreau”, “Henry Thoreau”, or “David Thoreau”.
The research has demonstrated that culturomic trends can be analyzed through careful investigation of textual media. Linguistic changes, or changes in lexicon and grammar, and cultural phenomena, including how people and events are remembered, are both important areas of investigation, and are treated in this study. The concepts that are discussed in textual media change over time, with certain words and phrases rising in popularity, then falling, and then possibly rising again in response to current events. Linguistic changes, or the differences in the ways that a single idea or event are expressed at different times and changes in lexicon and grammar, also reflect cultural change.
The authors provide an example with the phase “the Great War” (World War I), which peaked between 1915 and 1941, but thereafter, the phrase became “World War I”, although interest in that event was high pre- and post-1941. In addition, change of the forms of certain words can be investigated. For example, the verb “chide” has regularized between 1800 and 2000, with the past tense changing from “chid” and “chode” to “chided”. The verbs “burn”, “smell”, “spell”, and “spill”, and “thrive” have also regularized during this period, with “bend/bent”, “build/built”, “lend/lent”, and “send/sent” retaining the generally obsolete phonological “-t” instead of “-ed” to indicate past tense. Semantic factors contributed to the regularization for the “speeded” and “speed up”, with the meaning “sped” and “speeded” shifting from “to move rapidly” to “to exceed the legal limit”. Among other observations, approximately 1% of English speakers switch from “sneaked” to “snuck” every year, and the United States is the international leader in exporting both regular and irregular verbs.
The popularity of people can also be examined in terms of their age cohort. For instance, the most famous people in the 1882 cohort include the English modernist writer “Virginia Woolf” and former U.S. Supreme Court justice “Felix Frankfurter”, where as “Bill Clinton” and “Steven Spielberg” are prominent in the 1946 cohort.
Political suppression can also be detected from these publications. For example, the frequency of mentions of Jewish artist Marc Chagall in English and German books were compared. In both languages, mention of Chagall increased rapidly in the late 1910s, and continued to rise in English language publications after this time. However, Chagall’s popularity simultaneously decreased in German publications, reaching a low point in the period from 1936 to 1944. Other examples include mention of Trotsky in Russia, Tiananmen Square in China, and the (blacklisted in 1947) “Hollywood Ten” in the United States.
Among some other interesting findings are the following (Michel et al., 2011):
- Epidemiology – The word “influenza” peaks at dates of known pandemics, suggesting the value of culturomic methods for historical epidemiology.
- Politics – Trends for the phrases “the North”, “the South”, and “the enemy” indicate the political polarization in the years immediately preceding the American Civil War.
- Scientists – “Freud” is more embedded into the collective consciousness than “Galileo”, “Darwin”, and “Einstein”
- American diet – The words “steak”, “sausage”, “ice cream”, “hamburger”, “pizza”, “pasta”, and “sushi” are most common.
This research demonstrated the utility of quantitative techniques, specifically text analysis, to studying cultural trends. The work was enabled by applying computational techniques on digitized texts, and subsequently performing large-scale statistical analysis on the digitized data. Queries based on any research question can be rapidly performed through computational algorithms and statistical methods. Results can be quickly visualized, supplementing human cognition in detecting patterns and trends. The researchers summarize their contributions as follows: “Culturomic results are a new type of evidence in the humanities. As with fossils of ancient creatures, the challenge of culturomics lies in the interpretation of this evidence” (Michel et al., 2011) (7). They also illustrated culturomics concepts through examples of linguistic trajectories, and provided interpretations of these trends. However, the vast amount of data that was available at the time, which has grown enormously since, suggests that “…[m]any more fossils, with shapes no less intriguing, beckon” (Michel et al., 2011) (7).