Text Analysis with Term Frequency-Inverse Document Frequency
INTRODUCTION
In this section, text analysis algorithms implemented in Python will be discussed. To illustrate these concepts, a collection of raw documents will be converted to a matrix of TF-IDF features.
Given a collection of textual documents, or a corpus, it is often useful to determine which terms are important, or key for understanding the content of the text. One post-tokenization statistical approach to this task is simply to calculate the raw frequencies that a token appears in a specific document. The problem with this approach is that frequently occurring tokens in a given corpus have an exaggerated effect on the overall analysis, and consequently provide less information content than features that occur less frequently. Therefore, other metrics are needed which are less susceptible to emphasizing the overall impact of tokens that occur with high frequency in a training corpus.
TERM-FREQUENCY AND TERM FREQUENCY-INVERSE DOCUMENT-FREQUENCY
First, some concepts need to be discussed. Term frequency (TF) is a commonly used metric in text analysis, as is term frequency-inverse document frequency (TF-IDF). These metrics are weights that are assigned to terms in a document. They are frequently used for information retrieval and document classification. Term-frequency is a straightforward statistical measure, whereas term frequency-inverse document frequency is a measure of information, where terms that are over-represented in a corpus carry less information content than those that occur less frequently.
There are several sophisticated definitions of TF and TF-IDF, but the concepts can be illustrated with the simplest formulation. Term frequency is the frequency of the number of times a term occurs in a document relative to the total number of terms in the document, expressed as a fraction ranging from 0 to 1. It is a function of the term and the document in which it is found, and is calculated as follows:
[latex]tf(t, d) = f_{t, d} / N_{d}[/latex]
Here, t denotes the term, d denotes the document, ft,d is the number of times term t occurs in document d, and Nd is the total number of terms.
The next metric is inverse document frequency, or IDF, which is a measure of how common the term is within the corpus. IDF is a function of the term t, and the entire corpus of documents, denoted as D. If ND = |D| is the number of documents in the corpus, and Nt,D denotes the number of documents in D that contain the term t (mathematically: d ∈ D : t ∈ d), then IDF is calculated as follows:
[latex]idf(t, D) = \log(N_{D} / N_{t, D})[/latex]
The TF-IDF metric is a function of the term t, the document d, and the entire set of documents D. It is calculated as follows:
[latex]tfidf(t, d, D) = tf(t, d) \times idf(t, D)[/latex]
PYTHON EXAMPLE
A simple example worked in Python is used to illustrate these concepts. Suppose that a corpus contains four documents. That is, ND = |D| = 4. Further suppose that there are five terms of interest, t0, t1, t2, t3, and t4 (remember that in Python, index numbering starts at 0).
The five terms can be represented in a standard Python list data structure:
terms = ['empire', 'region', 'sea', 'culture', 'language']
The number of terms can be determined from the terms variable. The number of documents in the corpus is set to 4.
## Number of terms....
Nterms = len(terms)
## Number of documents in the corpus....
ND = 4
Using the nomenclature above, the total number of terms in each document, Nd, can be represented in a list (Note: these numbers are arbitrary in this example).
## Total number of terms in each document....
Nd = [373, 390, 171, 164]
In this example, assume that the number of times each of these terms occurs in each document has been determined. The resulting counts are represented in the 2D Numpy array terms_count.
terms_count
array([[ 5., 10., 13., 15., 8.],
[ 3., 1., 0., 5., 4.],
[12., 9., 3., 0., 0.],
[ 4., 0., 8., 4., 3.]])
For instance, in document 2 (starting numbering at 0), the term “region” occurs 9 times. In document 3, the term “language” occurs 3 times.
A Numpy 2D array is now allocated for TF and initialized to zeros. The array has ND
rows and Nterms columns. The result is displayed.
tf = np.zeros((ND, Nterms))
tf
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
The elements of the tf array can be calculated in a loop using vectorized operations provided by Numpy.
for iD in range(0, ND):
tf[iD] = terms_count[iD] / Nd[iD]
tf
array([[0.0304878 , 0.06097561, 0.07926829, 0.09146341, 0.04878049],
[0.01829268, 0.00609756, 0. , 0.0304878 , 0.02439024],
[0.07317073, 0.05487805, 0.01829268, 0. , 0. ],
[0.02439024, 0. , 0.04878049, 0.02439024, 0.01829268]])
In the code above, the loop index iD
indicates the index of the document. For document iD
, all term counts are divided by the total number of terms in document iD
. The operation is vectorized: terms_count[iD]
is the iD-th
row of terms_count, and consists of 5 values, indicating the number of times each of the 5 terms occurred in document iD
. The line of code terms_count[iD] / Nd[i]
divides all 5 term counts in document iD
by the total number of terms in one vectorized operation. The resulting five values are assigned to the iD-th
row of the tf
matrix.
The next step is to calculate IDF. A simple way to do this is to recognize that for each term, the number of documents that contain the term must be determined. Therefore, all that must be done to determine whether an entry in the tf
array is greater than 0. A value of 0 in row iD
, column iterm
indicates that term iterm
was not found in document iD
. The Boolean operation can then be used:
tf > 0
array([[ True, True, True, True, True],
[ True, True, False, True, True],
[ True, True, True, False, False],
[ True, False, True, True, True]])
That is, entering tf > 0
at the Python command prompt results in Boolean values indicating the presence of a specific term (the column) in a specific document (the row). Then, the number of True
values are counted across all documents for each term. The sum operation can be used for this purpose.
sum(tf > 0)
array([4, 3, 3, 3, 3])
This operation determines that the first term occurs in all four documents, while the second, third, fourth, and fifth term each occur in three documents. The IDF metric can then be computed with a single vector operation.
idf = np.log(N / sum(tf > 0))
Displaying the idf gives the following result:
idf
array([0. , 0.28768207, 0.28768207, 0.28768207, 0.28768207])
Finally, TF-IDF can be computed in a single loop using vectorized Numpy operations. First, a 2D array with ND rows and Nterms
columns is allocated. Each row of the TF-IDF matrix is computed by multiplying the corresponding row of TF with IDF, as indicated in the equation shown above:
[latex]tfidf(t, d, D) = tf(t, d) \times idf(t, D)[/latex]
This equation is implemented as follows:
tfidf = np.zeros((ND, Nterms))
for i in range(0, ND):
tfidf[i] = tf[i] * idf
The result is shown below.
tfidf
array([[0. , 0.01754159, 0.02280407, 0.02631238, 0.01403327],
[0. , 0.00175416, 0. , 0.00877079, 0.00701664],
[0. , 0.01578743, 0.00526248, 0. , 0. ],
[0. , 0. , 0.01403327, 0.00701664, 0.00526248]])
The reader will observe that TF-IDF is computed as the product of two seemingly conflicting values. The term frequency metric, TF, increases as the number of times the term is used increases relative to the number of terms, and the inverse document frequency, IDF, as the name implies, decreases with increasing frequency of the term. The reason that TF-IDF is often used instead of TF is to reduce the effect of a number of terms that occur very often in a corpus of documents, and therefore provide less information, especially when training a classifier. Terms that occur in a small fraction of the corpus provide more information specifically because of their rarity.
The Scikit-learn library in Python provides several functions for text statistics. See Here for a complete description of these functions.
The computation of some of the metrics differs slightly from the definitions given above. The TF-IDF measure is computed in the same manner as described. However, IDF is computed as: idf(t, D) = log(ND / Nt,D) + 1. Another common definition is: idf(t, D) = log(ND / [Nt,D + 1]). The addition of 1 is to ensure that terms that have an IDF of zero are still taken into account in further analyses (see Here).
The TF-IDF metric is very useful for clustering documents, as it provides a numerical representation of text to which subsequent computation can be applied. For instance, texts can be clustered on the basis of TF-IDF vectors. As was the case above, the text first needs to be converted into vectors. This step can be performed using the definitions given above, or the Scikit-learn library implementation of TF-IDF vectorization can be used.
If the TF-IDF matrix is too large, principal component analysis (PCA) can be used for dimension reduction. Finally, clustering is performed on the matrix. The k-means algorithm can be used for this purpose (example). A variety of preprocessing steps can be applied to improve the performance of document clustering (Beumer, 2020).