Computing And Programming
Working with computers is probably the single most important defining characteristic of digital humanities work. Data collection, digitization, cleaning, preprocessing, processing, analysis, visualization, and exploration are all performed on computers through software systems comprised of digital tools need by the user. Digital humanists use digital tools as required for a particular application or to solve specific problems. They also typically incorporate these tools into a workflow. A workflow is a systematic and repeatable procedure or set of procedures carried out to solve a particular problem or to perform a particular task. Workflows are organized steps taken to proceed from input or inputs, progressing through various preprocessing and processing stages, and arriving at a result or set of results (output or outputs). Workflows may include postprocessing and analysis of these results. For example, in text analysis, a typical workflow consists of the following.
- Obtain the digital or digitized data (text) from a source, including digitized documents, web sites, databases, social media, etc.
- Extract the textual data to be processed and move it into a corpus (plural corpora), or collection of documents.
- Perform various pre-processing transformations of the text, such as normalization (e.g., converting characters to lower-case), removing punctuation, other extraneous symbols or symbol combinations, and numbers, removing stop words, including those that are domain specific.
- Perform stemming to reduce words to their analyzable stems.
- Extract features from the text using any variety of methods. Typically, this step includes generating the document-term matrix (DTM).
- Perform analysis on the extracted features, such as calculating word frequency, collocation, or determining words or terms that frequently occur together or closely juxtaposed or locating particular words or phrases of interest in a text through dictionary tagging.
- Advanced analysis can be performed with statistical and/or machine learning methods and/or other computational techniques. Such analyses include topic modeling and document classification.
- Perform any other postprocessing that may be necessary for a particular application.
To perform the tasks in a workflow, digital humanists often use specialized, pre-programmed (executable, or binary) software packages that can be directly run on the user’s computer. Many such tools are freely available through Internet repositories. High-performance commercial software is also available. However, simply using these tools, even as a knowledgeable “power-user”, has limitations, as only the functionality provided by the tool can be utilized. Some tools enforce an established workflow that may not be sufficiently flexible. Users, even expert ones, cannot add new features unless the tool explicitly provides a scripting language or some mechanism to enable expansion. In addition, such tools evolve, necessitating the constant need for updating and possibly reinstalling the software system. Some packages may even stop being supported, thereby quickly falling into obsolescence. For high-performance commercial tools, the monetary cost of updating, maintenance, and support may become significant. As one way to address these problems, some humanities scholars are therefore recommending open-source software packages, where the source code used to generate the executable program is available.
In open-source software systems, the user is provided with the source code (in C, C++, Java, Python, etc.) and all adjunct files and libraries needed to compile the system – that is, to create the executable binary program(s) that executes on the computer. Very often, pre-compiled binary executable files are also included in the open-source distribution. However, if the system is based in a compiled language such as C, C++, or Java, creating the executables from the source code requires the user to have access to compilers, additional libraries, and perhaps other components. Compiling can sometimes be a difficult, time-consuming, and error-prone task.
Increasingly, digital humanities researchers emphasize the need for programming knowledge at a relatively advanced level (Tenen, 2016), (Goldstone, 2019). Through interpreted scripting languages such as Python and R, to name two languages especially popular in humanities work, programming need no longer be an activity carried out only by computational professionals. Python and R (and other programming languages as well) have a vast number of toolkits, or additions, known as libraries and packages for Python and R, respectively, that facilitate most kinds of humanities work, such as text analysis, topic modeling, image processing, and statistics. A knowledge of programming provides flexibility in designing, implementing, and adapting workflows, as program components can be incorporated into workflows as required by a particular research question. Scholars need not be constrained by the limitations of pre-programmed software packages (Goldstone, 2019). A large amount of freely available, high quality introductory and tutorial material is available through Internet sources. An expansive library of literature in the form of books and articles can be used to learn programming and programming languages. Programming can also be studied through online courses, academic classes, and short courses offered by colleges, universities, and other educational organizations. Importantly, popular and widely used languages such as Python and R have a very large user base with very sophisticated expertise. Consequently, problems encountered during programming can readily be solved through simple Web-based searches. An in-depth knowledge of a widely used programming language empowers scholars to develop their own tools and are better able to assess and interpret the results from these tools.