Reading
The material on the following site is important and should be read either before or after studying the text in this section.
Clustering text documents using k-means
This Scikit Learn page presents an example of using Scikit Learn functions to cluster documents by topics with a bag-of-words approach. Rather than using standard Numpy arrays, a sparse matrix representation, available through the Python Scipy package, is employed to store the features.
The following websites may be used for reference.
TfidfVectorizer (Scikit Learn)
This Scikit Learn page describes converting a collection of raw documents to a matrix of TF-IDF features.
TfidfTransformer (Scikit Learn)
This Scikit Learn page describes transforming a count matrix into a normalized TF or TF-IDF representation.
The following material is optional. However, interested readers are encouraged to peruse it.
Clustering Documents with TFIDF and KMeans
This web site presents an example of using TF-IDF features to cluster documents. The k-means unsupervised clustering algorithm is used.
Python Code
This section uses the Python code TF-IDF_Example.py (Jupyter Notebook TF-IDF_Example.ipynb).