Text Clustering Based on Tf-Idf Features
INTRODUCTION
In this section, further examples of text clustering and other text processing and analysis are presented.
In previous sections, k-means clustering was performed on numerical values to illustrate the basic concepts. Additionally, text clustering was illustrated through feature sets of a corpus Greco-Roman texts, including known works, commentaries, modern editions, and manuscripts. The first three features were first selected and clustered – i.e., the 3D samples were clustered, and, subsequently, all four features were selected and clustered – i.e., 4D samples were clustered. The characteristics of the texts were already provided, and, in that example, were not directly derived from the texts, which were not provided. In the current example, features will be calculated from texts and clustered. The clustering and analysis are performed in Python. The example that follows is based on the demonstration of sentence clustering in.
K-MEANS CLUSTERING ON TF-IDF FEATURES
Parts of documents were obtained from the first few paragraphs of Wikipedia articles on empires in the ancient world:
- the Seluecid Empire;
- the Pala Empire;
- the Neo-Assyrian Empire;
- the Khmer Empire;
- the Durrani Empire; and
- the Majapahit empire.
To perform this task, the necessary functions are first imported from the Scikit Learn package.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
The documents can be read from text or into a data frame from a CSV file. However, in this example, the individual documents are short, and therefore the text can be directly incorporated into the Python code.
documents = [
"The Seleucid Empire was a Greek state in Western Asia, ... ",
"The Pala Empire was an imperial power ...",
"The Neo-Assyrian Empire was an Iron Age Mesopotamian empire ... ",
"The Khmer Empire are the terms that historians use to refer ... ",
"The Durrani Empire, also called the Sadozai Kingdom ... ",
"The Majapahit was a Javanese Hindu thalassocratic empire ... "
]
The TfidfVectorizer()
function is used to convert the documents to a matrix of TF-IDF features.
## Initialize a Tfidf vectorizer to convert the documents to TF-IDF features.
## The English stop words supplied by Scikit Learn are used.
vectorizer = TfidfVectorizer(stop_words = 'english')
The next step is for the vectorizer to learn the vocabulary and inverse document frequency (IDF) of the documents, and to generate the document-term matrix.
## Generate the document-term matrix.
X = vectorizer.fit_transform(documents)
The document-term matrix, X, is a sparse Numpy array (i.e., it contains many zeros) where the rows represent the documents, and the columns represent the terms that were determined from the documents. Because the matrix is sparse, it is stored in a compressed format. In the current example, the term frequency-inverse document frequency (TF-IDF) metric is computed on the texts result in a 6 x 627 sparse matrix, where 6 indicates the number of documents (text about the six empires) and 627 indicates the number of terms that were determined.
>>> X
<6x627 sparse matrix of type '<class 'numpy.float64'>'
with 781 stored elements in Compressed Sparse Row format>
The terms that were determined can be explored with the get_feature_names()
function on the vectorizer object. The results are sorted into alpha-numeric order. The first ten terms are related to numbers.
>>> vectorizer.get_feature_names()[0:10]
['10th', '11th', '1293', '12th', '1350', '1365', '1389', '13th', '1527', '15th']
For subsequent analysis, the terms can be stored as a variable.
## Get all the terms that were determined.
terms = vectorizer.get_feature_names()
The ten terms starting at index 540 can be displayed, for instance.
>>> terms[540:550]
['standard', 'starting', 'state', 'status', 'steady', 'stretching', 'strong', 'stronghold', 'struggled', 'studies']
Clustering can subsequently be performed on the values in the TF-IDF matrix, X in this example. Three clusters will be determined (k = 3). For faster convergence, an advanced technique (denoted with the k-means++ option) is used to select the initial cluster centers. The maximum number of iterations is set to 1000.
## Number of clusters....
K = 3
## Run the k-means algorithm,.
model = KMeans(n_clusters = K, init = 'k-means++', max_iter = 1000, n_init = 1)
model.fit(X)
The clustering can subsequently be examined. To determine the top N terms in each cluster, the following code can be run, illustrated with N = 5.
## User feedback....
print('Top terms per cluster:')
## Order the centroids of the clusters.
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
##terms = vectorizer.get_feature_names()
## Display the top N terms in each of the clusters.
N = 5
for i in range(K):
print("Cluster %d:" % i),
for ind in order_centroids[i, :N]:
print(' %s' % terms[ind]),
print
The top five terms for each of the clusters in this example are shown below.
Top 5 terms per cluster:
Cluster 0:
empire
bc
pala
assyrian
seleucid
Cluster 1:
majapahit
angkor
empire
indonesia
khmer
Cluster 2:
shah
ahmad
durrani
abdali
empire
With these clusters, the cluster in which new sentences occur can be predicted. For example, suppose that the document to which the following phrase is most related is to be determined.
"Technology of Southeast Asian kingdoms"
The phrase can be transformed by the Tfidf
vectorizer described above.
Y = vectorizer.transform(['Technology of Southeast Asian kingdoms'])
This statement results is a sparse matrix with 1 row and 627 columns, with the same number of terms (627) as for the TF-IDF metrics calculated above.
>>> type(Y)
<class 'scipy.sparse.csr.csr_matrix'>
>>> Y.shape
(1, 627)
The cluster to which this phrase belongs can be determined with the predict() function.
prediction = model.predict(Y)
The prediction denotes the cluster to which the phrase was predicted to belong. In this case, this is Cluster 1.
>>> prediction
array([1])
By observing some of the top phrases in Cluster 1 (e.g. “indonesia”, “khmer”), it can be seen that the phrase containing “Southeast Asian” likely belongs to this cluster.
In another example, suppose that the document to which the following phrase is most related is to be determined.
"Technology of Central Asian kingdoms"
The phrase can be transformed by the Tfidf
vectorizer described above.
Y = vectorizer.transform(['Technology of Central Asian kingdoms'])
The cluster to which this phrase belongs is again determined with the predict()
function.
prediction = model.predict(Y)
The prediction denotes the cluster to which the phrase was predicted to belong. In this case, this is Cluster 0.
>>> prediction
array([0])
By observing some of the top phrases in Cluster 1 (e.g. “assyrian”, “seleucid”), it can be seen that the phrase containing “Central Asian” likely belongs to this cluster.
CONCLUSION
This short example illustrates that TF-IDF features obtained from documents can be clustered for the purpose of predicting the clusters to which new, unseen phrases belong. As with most machine learning techniques, accuracy and performance improve with a larger number of documents, as well as with longer documents. In addition, the number of clusters (the k hyperparameter) must be selected carefully.