Topic Modeling and Basic Topic Modeling In R
INTRODUCTION
Topic modeling is an important, yet complex concept in text mining in the digital humanities, employing techniques from machine learning and natural language processing. A topic model is a statistical model that determines topics in a corpus or other document collection. The topics are not known a priori. These topics assist in the analysis of semantic structures that are embedded in text. Topic modeling, like k-means clustering, is an unsupervised machine learning technique. It is based on the premise that certain words or terms are expected to frequently appear in documents about a particular topic. For instance, “rain”, “snow”, and “wind” would be expected to occur frequently in documents related to “weather”, whereas “onion”, “kale”, and “celery” would often be found in documents about “vegetables”. Common words, such as conjunctions, prepositions, etc., and words not directly related to the topic (e.g. “happen”, “great”, “thought”, etc.) may occur with equal frequency in both sets of documents. This underlying idea of topic modeling is expressed mathematically by computing statistical measures on the words in the documents, which ultimately enables topics to be discovered, and the distribution of the topics throughout the documents analyzed. Technically, as a statistical modeling technique based on machine learning, topic modeling generates clusters of similar words, and therefore the “topics” are abstract. In this sense, the topics are analogous to the abstract clusters produced by the k-means algorithm.
Consequently, topic modeling allows scholars to infer the “latent structure” of a corpus or other document collection. It is most applicable to large volumes of documents, as scholars can infer this structure from a smaller collection of documents, such as those of a single writer (Underwood, 2012)
According to digital humanities and literary scholar Ted Underwood: “Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them” (Underwood, 2012), found Here. It is a very popular method amongst historians (although digital mapping is the major focus in digital history research), and has been applied to text mining projects, such as the “Mining the Dispatch” project to analyze topics from the Richmond Daily Dispatch newspaper during the U.S. Civil War. Topic modeling is now a major topic for literary scholarship also. However, because they pursue different questions, historians and literary scholars will likely use topic modeling in different ways (Underwood, 2012).
In literary studies, topic modeling is extremely widespread in the digital humanities, including the analysis of literary genre. Along with text mining, it has been classified as the predominant practice in digital literary studies. For instance, it has been used to study genre in French drama from the classical and Enlightenment periods (Schöch, 2021).
Topic modeling is useful for determining the semantic structures of large collections of documents through the application of statistical algorithms. It is important not only in the digital humanities, but in all areas in which text is analyzed and organized, especially due to the massive amount of unstructured text that is constantly generated from a multitude of sources, many of them Web-based. It is particularly useful for organizing and for facilitating analysis and querying of large digital archives. Topic modeling is also attracting attention in medical fields. For instance, topic modeling was used for text mining biomedical corpora. The technique employs a method based on inverse document frequency and applies a variation of k-means clustering to cluster the documents (Rashid et al., 2019). The algorithms used in topic modeling have also been adapted to other application areas, including classifying genomic sequences in bioinformatics. In this case, genomic sequences represent “documents”, and small DNA strings of a specified length, say k, known as k-mers, are considered as “words”. From these words in a given sequence, a “bag of words” model can be determined. The bag of words is an N x V matrix, where N denotes the number of genomic sequences (“documents”), and “V” denotes the number of k-mers (“words”) (Liu et al., 2016). The resulting bag of words model is subsequently input into a topic modeling algorithm to determine the DNA clusters without sequence alignment, which is typically performed to classify sequences (la Rosa et al., 2015), (Liu et al., 2016).
A growing research area with a large number of applications is dynamic topic modeling, a subdiscipline within topic modeling in which the temporal dimension, time, is introduced into the analysis. This approach is used in semantic time-series analysis, document classification, and event detection. Because topic modeling provides a representation of the semantic structure of documents, a significant change in that structure may be indicative of a real-world event that is reflected in that change.
LATENT DIRICHLET ALLOCATION (LDA)
The most common topic model is Latent Dirichlet Allocation (LDA), a statistical model originally developed for population genetics, but establishing a foothold in natural language processing as a machine learning technique. The LDA algorithm is complex mathematically, and a full treatment of the method is beyond the scope of the current discussion. A simplified, straightforward explication is found Here. In essence, the algorithm begins by randomly assigning words to topics in an initial model. This initial model is iteratively refined to improve the internal consistency of these assignments, in that
- words that are common in topics will become more common in those topics as the model is refined, and
- topics that are common in documents will become more common.
The process iterates until the model stops changing – that is, until it reaches a consistent equilibrium (Underwood, 2012).
As stated above, each document in a corpus is treated as a “bag of words”, where the ordering and grammatical content of the words is unimportant to the model. Unimportant words (conjunctions, prepositions, etc.) are also removed from further analysis, as are words that are used frequently or overrepresented, as those words have low informational content (because they are common) and are therefore not relevant for classification. As is the case with k-means clustering, the number of topics (k) must be pre-specified by the user. At the start of the algorithm, it is also assumed that all topic assignments are correct with the exception of the current word being analyzed, as the assignment of the current word to a topic is updated through the model. For each topic, say Z (following the notation in (Underwood, 2012)) and for each word, say w, in document D, the probability that word w belongs to topic Z in document D is calculated as the proportion:
Probability that w in D is in topic Z = (number of times w occurs in Z + a) / (number of words in Z + b)(number of words in D that are in topic Z + c).
Here, a, b, and c are hyperparameters, or “fudge factors” (numbers that can be adjusted to improve algorithm performance), that are needed in machine learning algorithms. The fact that probabilities are being computed underscores the designation of topic modeling as a statistical, or probabilistic model. With these probabilities computed, words are re-assigned to topics resulting in the highest probability value. The probabilities are then repeated for each word in the collection, re-assigning those words using the same technique. The entire process is then iterated (repeated) until equilibrium is reached, meaning that the two conditions described above are achieved. As in most probabilistic models, perfect internal consistency is an ideal that is never achieved in practice, because words and documents are not in a one-to-one relationship, as the distribution of words across documents does not follow a neat statistical ideal (Underwood, 2012).
APPLICATION TO ANALYZING GENRE
T opic modeling is also useful for discovering abstract topics that are important and defining for a particular genre. An example of recent work in this area is topic modeling of the new genre of “Defining the Digital Humanities. The digital humanities field has been described as suffering from a “definition addiction” (Callaway et al., 2020). As indicated in the first section of the first course in this certificate program, the literature on what digital humanities is/are (its “welcoming”, or “pull” aspects) and is not (its “gate-keeping”, or “push” aspects) is something of a genre unto itself. Definitions abound, some of them canonical, such as the widely cited 2010 definition and description of Matthew G. Kirschenbaum (Kirschenbaum, 2016). However, a recent paper has identified 334 such definitions (Callaway et al., 2020).
The researchers collected and curated a corpus of digital humanities definitions as text (.txt) files. All definitions were in the English language. Additionally, they included 15 different metadata fields for identifying information such as department, career stage, and institution for the authors of each definition (Callaway et al., 2020). The researchers considered this corpus to convey broad coverage of digital humanities areas. Although the size of the corpus was considered to be adequate, they admitted that it is relatively small. To investigate the question as to what constitutes the digital humanities, the researchers performing the study selected k = 55 topics. In this research, topics were designated as a group of words that tended to co-occur in the collection of 334 documents. The researchers at first perceived this number to be too arbitrary, but eventually concluded that k = 55 resulted in meaningful topics that did not overlap due to them being excessively narrow.
Topic modeling was performed with the Mallet R package, which is an R wrapper for the Java machine learning tool MALLET. A wrapper is a function called by a user that subsequently calls or invokes another function. In this case, the researchers ran the topic modeling with an R function, and this function ran another function written in the Java language that provides the actual topic modeling functionality. Three subsequent analyses were performed. Visualization was performed on word clouds (tag clouds) of the top 100 words in each topic. Another analysis investigated timeline graphs of the average presence of each topic through time, calculated as the ratio yearly sum of all occurrences of a topic (as found in the topic-document matrix) to the number of documents in each year. In a third analysis, the location of the institutional affiliation of the first author of each document at the time of publication was mapped. From this mapping, heat maps consisting of the colour-coded results provided an indication of the spatial presence of each topic. However, the most useful results were obtained by analyzing the metadata (Callaway et al., 2020).
From the results produced, the researchers analyzed four topics in detail. Recall that the topics determined by topic modeling are abstract: they are not associated with a pre-defined topic, analogous to clusters produced by the k-means algorithm that may or may not have intrinsic meaning for the question being researched. Some of the words included in the topics were: “job, field, scholars, students, programming”, “values, community, open, collaboration, openness”,
“literary, reading, text, literature, criticism”, and “race, projects, gender, women, studies”. The topics represented by these words were respectively designated as “Code”, “Community”, “Distant Reading”, and “Diversity and Inclusion”.
The researchers arrived at some interesting conclusions. Because of the multitude of definitions, the “push”, or “gate-keeping” mechanism of the digital humanities (barriers to entering or progressing in the field, such as a lack of programming knowledge) is offset. However, this same definitional variety can be disorienting to aspiring digital humanities scholars. The researchers detected possible gender trends in the topics classified as “Distant Reading” and “Diversity and Inclusion”, suggesting the potential influence of gender differences across different topics. The authors also found that the definitions in the corpus contained gender and class imbalance, as evidenced by the large number of definitions written by male academics. In analyzing technical competencies, the authors draw a conclusion not directly related to topic modeling. Although they used R functions to run the sophisticated algorithms (LDA) needed for topic modeling, they reported that their most useful analyses were performed with spreadsheets and basic chart visualizations. They even emphasized that the latter were not performed with the powerful visualization capabilities in R, but with the simpler functionalities of their spreadsheet. While not dismissing the power and utility of topic modeling, they propose that the purported power and complexity of the approach need not be a barrier (a “push” mechanism) to those working in or entering the digital field. The researchers conclude that the task of defining digital humanities is unfinished, and that that activity continues unabated (Callaway et al., 2020).
EXMPLE OF TOPIC MODELING IN R
(Note: Please refer to the R script textAnalysis_110621.R, included in the distribution for this course.)
This example demonstrates topic modeling on some sections of text obtained from Wikipedia entries.
First, the necessary packages are loaded.
library(readtext)
library(topicmodels)
## Needed for the corpus() function….
library(quanteda)
The text is then read. The text is a very small corpus of the first few paragraphs of Wikipedia.com articles about the digital humanities and a technique in signal processing (wavelet analysis). Specifically, texts were drawn from the following Wikipedia articles.
## Directory and input file name....
inputPath <- ‘Input File Path'
inputFn <- 'subjectData.csv'
## Full file path is formed by concatenating the input path and input file name.
fn <- paste(inputPath, inputFn, sep = '')
## Read the text.
rt <- readtext(fn, text_field = "subjText")
## Get the document names, indicated by "subjects" in this example.
doc_names <- rt$subject
## Change the default document ID to the document names.
rt$doc_id <- doc_names
## Create 'quanteda' corpus.
fulltext <- corpus(rt)
## Recall that 'fulltext' is a corpus.
txts = corpus_reshape(fulltext, to = “paragraphs”)
The document-term matrix is then calculated from the text, removing punctuation and common English language stop words.
## Create a document-term matrix.
par_dtm <- dfm(txts, stem = TRUE, remove_punct = TRUE,
remove = stopwords(“english”)
)
Rare terms, in this case those with fewer than five occurrences, are removed.
## Remove rare terms.
par_dtm <- dfm_trim(par_dtm, min_count = 5)
To perform topic modeling, the modified document-term matrix is converted to a format that can be used internally by the LDA algorithm implemented in the topicmodels package.
## Convert to topicmodels format.
par_dtm <- convert(par_dtm, to = “topicmodels”)
Initial assignments of words to topics are performed randomly. For reproducibility, the pseudo-random number generator is seeded (initialized) with a specific value so that the same sequence of random numbers is always produced.
## Seed the randomization for reproducibility.
set.seed(1)
The LDA algorithm in the topicmodels
package is then run. In this example, the number of topics (k) is set to 5. The implementation of the LDA algorithm is encapsulated in the LDA()
function.
## Generate the LDA model.
lda_model <- topicmodels::LDA(par_dtm, method = “Gibbs”, k = 5)
The resulting topics are then displayed with the terms()
function. The five most important terms are displayed in this example.
## Display the terms.
terms(lda_model, 5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "digit" "use" "analyt" "wavelet" "data"
[2,] "studi" "comput" "visual" "signal" "inform"
[3,] "human" "analysi" "analysi" "can" "text"
[4,] "softwar" "learn" "techniqu" "time" "process"
[5,] "cultur" "numer" "data" "transform" "scienc"
From analyzing words associated with the five topics, and with some domain knowledge of the text in the corpus, one possible designation of the topics is as follows.
Topic 1: digital humanities (from the tokens “digit”, “cultur”, and “software”)
Topic 2: computational analysis (from the tokens “comput”, “analysi”, and “numer”)
Topic 3: visual analytics (from the tokens “visual”, “analyt”, “technique”, and “data”)
Topic 4: wavelet analysis (from the tokens “wavelet”, “signal”, “time”, and “transform”)
Topic 5: data science (from the tokens “data”, “scienc”, “text”, and “inform”)
Note that there are alternative, equally valid designations for these topics.
The analysis can be easily re-run with k = 3 topics.
## Generate the LDA model.
lda_model <- topicmodels::LDA(par_dtm, method = "Gibbs", k = 3)
The resulting five most important terms are displayed with the terms()
function.
## Display the terms.
terms(lda_model, 5)
Topic 1 Topic 2 Topic 3
[1,] "data" "visual" "use"
[2,] "analyt" "digit" "can"
[3,] "analysi" "comput" "wavelet"
[4,] "inform" "studi" "numer"
[5,] "scienc" "human" "algorithm"
From again analyzing words associated with the five topics, and using domain knowledge of the text in the corpus, one possible classification for the topics could be:
Topic 1: data analytics (or data science) (from the tokens “data”, “analyt”, and “scienc”)
Topic 2: visualization (from the tokens “visual”, “comput”, and “digit”)
Topic 3: wavelet analysis (from the tokens “wavelet”, “algorithm, and “numer”)
This simple example demonstrates that basic topic modeling can be performed with a few lines of code in R. However, it must be remembered that the algorithmic complexity underlying LDA is “hidden” from the user through the LDA()
function, which implements this powerful approach.