Text Mining Workflow

Text mining is a vitally important task in many humanities domains.  Most obviously, it is used in text analysis and investigating corpora of prose, poetry, or other written material for humanities scholarship.  However, it has a major role in other areas of humanities scholarship, including history and classical studies.  Outside of the humanities, text mining is increasingly important in biomedical applications, including automated analysis of electronic medical records (EMRs), which often contain unstructured text, images, brief annotations and notes, and even handwritten text.

 

Text mining, sometimes called text data mining, is closely related to text analytics.  The goal of text mining is extracting information from unstructured text.  This information is subsequently used in natural language processing (NLP) and/or subjected to further computational analysis. As text does not have the same form as numerical data, which are more conducive to computational processing, text mining functions are more complex than operations on numbers.   Text mining has several components and involves multiple operations, including:

  • information extraction,
  • information retrieval,
  • analysis of word frequency distribution,
  • pattern recognition to facilitate determining and
  • utilizing context, predictive analytics, and, increasingly,
  • visualization.

Information extraction relies on NLP, which provides extraction algorithms with linguistic data.  NLP is employed for document annotations, as it can perform parts-of-speech tagging (described below), determine the boundaries of sentences, and parse text.

 

Information retrieval is a set of techniques to represent, store, and access items in textual sources, such as books, newspapers, reports, and other types of documents.  This textual information is formatted and stored in database management systems that are used to process queries from users and to return the results of these queries.  Such queries reduce and refine the number of documents of interest to a particular question.  Search engines are prototypical examples of information retrieval systems (Wasilewska).

 

Additionally, information visualization in particular is used to gain insights into the results of text analysis and can be employed at each stage in the text mining workflow.  Visualization is a key component of many areas of humanities scholarship, including distant reading, and, increasingly, detailed analyses in close reading (Jänicke et al., 2015).

[NEXT]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Contemporary Digital Humanities Copyright © 2022 by Mark P. Wachowiak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book