Reading

The material on the following site is important and should be read either before or after studying the text in this section.

Clustering text documents using k-means

This Scikit Learn page presents an example of using Scikit Learn functions to cluster documents by topics with a bag-of-words approach.  Rather than using standard Numpy arrays, a sparse matrix representation, available through the Python Scipy package, is employed to store the features.

 

The following websites may be used for reference.

TfidfVectorizer (Scikit Learn)

This Scikit Learn page describes converting a collection of raw documents to a matrix of TF-IDF features.

 

TfidfTransformer (Scikit Learn)

This Scikit Learn page describes transforming a count matrix into a normalized TF or TF-IDF representation.

 

 

The following material is optional.  However, interested readers are encouraged to peruse it.

Clustering Documents with TFIDF and KMeans

This web site presents an example of using TF-IDF features to cluster documents.  The k-means unsupervised clustering algorithm is used.

 

Python Code

This section uses the Python code TF-IDF_Example.py (Jupyter Notebook TF-IDF_Example.ipynb).

[NEXT]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Digital Humanities Tools and Techniques II Copyright © 2022 by Mark Wachowiak, Ph.D. is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book