Components of Text Mining
In this phase, we are extracting text document from various types of external sources into a text index (for subsequent search) as well as a text corpus (for text mining). Document source can be a public web site, an internal file system, etc.
For instance, one may need to use a search, examine, or download a predefined list of web sites. The text would then be parsed, and subsequently converted into multiple documents that are stored in a text index and/or a text corpus. Likewise, text may be searched, monitored, and extracted from Twitter for a specific topic. This text may also be stored into a text index and corpus. Additionally, texts in different languages may employ machine translation services, if necessary.
To extract text from the Web, a technique known as web scraping is often employed. Web scraping is described in more detail in the section on Internet Studies . At this point, a concise explanation will suffice. It is often difficult to obtain data resources from the Internet and web services, and, consequently, humanities scholarship may not benefit from this vast data repository (Black, 2016). Web scraping is an approach to make this data more accessible by using specialized software programs and scripts to extract specific, meaningful data from web pages. Programs that are used for this purpose include web crawlers, bots, and Web APIs. These programs navigate web pages to obtain target data directly from the HTML that constitutes the web page. This process is quite different than automatically obtaining screen captures or limited amounts of text from the page. Web sites to be “crawled” are pre-selected in a systematic manner. The result is a large collection of unstructured data that must be subsequently transformed into a semi-structured (e.g. TEI) or structured format so that it can be used (Zeng, 2017). Web scrapers are available in Python and R, the two programming languages most frequently used in digital humanities scholarship.
After the text of interest has been extracted, it must be transformed into a form conducive to further computational processing. Text transformation consists of several subtasks.
Text normalization is generally performed to facilitate and improve text transformation. Conversion of tokens to lowercase, expanding contractions, and the removal of punctuation, numerical values, and stop words are part of this process. Another optional normalization procedure is to convert words to synonymous forms. Wordnet is a large database focused on semantic relationships between words in a large number of languages. Some of these relationships are synonyms, hyponyms – words in a type-of relationship with a hypernym, where the hypernym denotes a supertype, and the hyponym denotes a subtype of the hypernym (e.g. “truck” is a hyponym of the hypernym “vehicle”), and meronyms, which are in a part-of relationship with a corresponding holonym (e.g. “microprocessor” is a meronym to the holonym “computer”). Additionally, domain-specific dictionaries can be used for texts in a targeted subject area (Wasilewska). Stemming, described below, is generally considered to be a normalization procedure.
In tokenization, text documents are decomposed into individual elements, or tokens. Words are the most common tokens. Tokenization segments, or separates, the entire text into words, removing white space (blank spaces, tabs, etc.) and punctuation (e.g. commas, periods, etc.)
Stop word removal: In this step, words that are generally not of interest, such as articles and prepositions, are removed. Stop word removal also involves removing HTML and XML tags that occur in text obtained from web pages. When processing text is obtained from the web, this step is normally performed first, followed by removal of other stop words.
Stemming is the process of identifying the root of a word, called the stem. In other words, it reduces the inflectional and derived forms of a certain word. For instance, in a typical stemming scheme, care, caring, careful, and carefully would be mapped to “care”. Verb conjugations are also stemmed, such as “am”, “is”, “are” are mapped to the infinitive “be”. Stemming algorithms are often heuristic, and can range from crude truncation of word suffixes to more sophisticated methods. Methods based on truncation may result in over-stemming, where the word is truncated to such a degree that its stem is lost. Under-stemming is reducing similar forms of words that should have one stem to two or more distinct stems. An example of an advanced stemming technique is Porter’s algorithm, which consists of five phases of sequentially applied word reductions. In each phase, various rules are selected and applied to successively refine the stemming result. There are also rules to check the number of syllables to distinguish between a suffix and stem of the word. Other stemming techniques include the older Lovins algorithm from 1968 and the more recent Paice-Husk algorithm from 1990. Examples of the results of applying these three stemming techniques on a short sample of text can be found HERE.
Lemmatization is the process of analyzing words from vocabularies and morphological structures of the words. Morphology refers to the study of words and their parts, or morphemes. Morphemes are the most basic and smallest meaningful units of words. They include base words (the base word of “technological” is “technology”), prefixes (the prefix of “inaccurate” is “accurate”), and suffixes (the suffix of “prepositional” is “preposition”). Lemmatization algorithms are more complex than stemming. Its goal is to remove inflectional endings to obtain the lemma of the word, or base word, canonical form, or the dictionary form of the word. Whereas the word “saw” may be stemmed as simply “s”, lemmatization would return the token “see” or “saw” after determining whether the token is a noun or verb. Furthermore, in contrast to stemming, which normally returns the stem from derivationally related words known as “collapsing” the word, lemmatization generally collapses only the different inflectional forms of a lemma. Both stemming and lemmatization algorithms are available as powerful and multi-featured commercial packages, as open-source programs and plug-ins, and as library functions and packages available in a variety of programming languages, including Python and R, the two languages used most widely in humanities domains.
In addition to tokenization, stemming, and lemmatization, sentences must be segmented to obtain information entities mentioned in the document. Paragraphs are segmented to provide context so that relationships between entities can be analyzed. Two important components of these tasks are Part-Of-Speech (POS) tagging, where the part of speech (noun, verb, adjective, etc.) for each entity is determined, and entity tagging, where the designation for each entity, such as person, place, concept, organization, etc., is determined. These processes require sophisticated algorithms, as tagging a word is dependent upon the context in which the word occurs.
In the context of machine learning, feature extraction is to determine a relatively small number of variables and combinations of these variables that adequately describe and represent the original data, for the purpose of further analysis. Feature extraction is necessary to reduce the amount of computational resources needed to analyze large volumes of data. It is related to feature engineering, which uses domain knowledge to extract characteristics and descriptive attributes from data. Feature extraction is a large field with a large number of sophisticated algorithms are available, including principal component analysis (PCA) and independent component analysis (ICA). The latter technique is employed to solve the problem of source separation, where data are composed of a mixture of components that are generally independent of each other and must be separated into these independent sources. A common example of this type of problem is the cocktail party problem, wherein sounds from different sources are heard by a human listener as a mixture of these sounds, causing the brain to attempt to discern the individual sounds, such as when guests at a cocktail party hear the sound of their names, causing the guests to listen more attentively to the source of that sound.
Most relevant to text mining is feature selection, or variable selection, which is considered to be a subset of the more general area of feature extraction (Sumathy & Chidambaram, 2013). Feature selection is the process of determining the most relevant or important features of the many features that are extracted, removing redundant or unimportant features from further consideration. The features ultimately selected are subsequently used to create a model of the data. One such important model that is used is the “bag of words”. The bag of words model is a common feature set in text mining, and is widely used in information retrieval, document classification, and natural language processing (NLP). In the bag of words model, a text or document is represented as a set of words that may contain duplicates (also known as a multiset), but in which grammar is disregarded. The model represents the multiplicity (count) of each word in the bag, but the ordering of the words is unimportant. Consequently, the bag of words model is often used with other approaches, such as n-grams, which represent ordered sequences of words.
After feature selection, further analysis steps are performed on the corpus. A typical operation is to generate the document term matrix, a large, sparse matrix used for calculating statistics and other measures. Because text is not naturally in the numerical format that is conducive to computational manipulation, text processing and analysis is computationally and memory intensive. Therefore, specialized procedures are often performed to improve efficiency and to reduce the amount of space needed to store the corpus. Consequently, numerical vector representations of text are converted from “term” space to “topic” space, wherein documents with similar topics are associated with each other, although they may use different terms. For instance, words such as “sleet”, “hail”, and “snow” may be mapped to the same topic if they frequently co-occur. Computational techniques exist to address these issues. For example, singular value decomposition (SVD) is a common technique with wide applicability in many disciplines, including science and engineering fields. SVD is a common matrix factorization technique that can be used to convert a “term” vector into a “concept” vector (Wasilewska). SVD is a factorization method that decomposes a large sparse matrix of size M x N (M rows and N columns) into the product of three smaller, dense matrices of sizes M x K, K x K, and K x N. From linear algebra, the product of two matrices of sizes (M x K) and (K x K) is a matrix of size M x K (the number of rows in the first matrix x the number of columns in the second matrix). This matrix, when multiplied by another matrix of size (K x N), results in a matrix of size (M x N), which is the size of the original matrix. (In general, if matrix A has size (M x N) and matrix B has size (N x K), then the product C = AB has size (M x K). Note that the number of columns in A must be equal to the number of rows in B). In addition to SVD, topic modeling is another popular technique for transforming the document into a smaller set of topic dimensions (Wasilewska).
Further data mining analysis includes document clustering, text categorization, text clustering, and, frequently, sentiment analysis.
Clustering is a common machine learning technique. It is an unsupervised method, as clustering is performed only based on the characteristics of the samples that are to be clustered, and not through human guidance, as is the case in supervised machine learning algorithms. Human users are usually only involved in selecting the number of desired clusters. Even in this case, recent techniques exist to automatically select the optimal number of clusters. In the context of text mining, clustering is used for organizing documents and facilitating searching and browsing these documents, and for summarizing corpora. There are two primary approaches: the k-means clustering algorithm and hierarchical methods. In the k-means technique, a user specifies a number of clusters, denoted as k. It is a distance-based technique, wherein some distance measure is calculated between individual data elements. Based on the features of the data being clustered, the algorithm assigns clusters to each data element so that the distance measure between data within the same cluster (intracluster distance) is minimized. The k-means approach is iterative. Initially, data are randomly assigned to clusters (although some variations and enhancements of the technique have preprocessing steps to more “intelligently” assign initial clusters). The intracluster distances are then minimized, and clusters are reassigned to the data based on these distances. The process continues until the algorithm no longer reassigns clusters. In that case, the optimal (in a mathematical sense) intracluster distance has been achieved, and the data have been assigned to clusters based on these optimized measures. In hierarchical clustering analysis (HCA), a hierarchy of clusters is constructed. Agglomerative hierarchical clustering is the process whereby pairs of smaller clusters are merged to gain a higher position in the hierarchy. It is therefore classified as a “bottom-up” approach. In the divisive approach, conversely, clusters are formed by recursively splitting larger clusters that have a lower position in the hierarchy and is therefore a “top-down” approach. The resulting hierarchy can be visualized in a special plot known as a dendrogram, which is a tree-like structure wherein the nodes of the graph are arranged in a hierarchical manner, connected by edges denoting the hierarchical relationships. A generic example of a dendrogram is found in Figure 1 of Clustering Corpus Data with Hierarchical Cluster Analysis. An example of HCA in text analysis is clustering 15 text categories in a U.S. English corpus, based on the prepositions in each of the categories, ignoring the lengths of the prepositions. The resulting dendrogram is shown in Figure 2.
Text categorization is the process of categorizing and classifying documents, based on the content of the documents. Weights are assigned to subjects within the document, which are subsequently used to determine the assignment of a document to a particular class. This content-based classification is used in language identification, and in the relatively new field of sentiment analysis, in which subjective measures, such as affective states, are assigned to text (Wasilewska).
In the interpretation and evaluation stages of text mining, researchers analyze the results obtained from the previous analyses. The accuracy of these results is assessed employing the domain knowledge of the researcher and assisted with statistical and computational analysis and visualization. If necessary, the input data for one or more of the processes is refined and parameters of the algorithms are adjusted, and the entire process, or parts thereof, are repeated and subsequently re-analyzed (Wasilewska).