Is Coding Necessary for A Digital Humanist?

Mark P. Wachowiak

Is Coding Necessary for A Digital Humanist?

As much as rich, interactive interfaces aid humanities scholarship, there is also a compelling counterargument against such complexity. Making tools more intuitive and easier to use adds to the complexity of the entire system. Furthermore, it adds layers of abstraction that remove users from the underlying analysis algorithms used in the system, thereby complicating interpretation. Put simply, oversimplification often leads to obfuscation.

Scholarly work in the digital humanities is generally not the site of “big software” development. Code development in this field is by necessity agile and experimental. Digital Humanities is part of the “computational turn”, as evidenced by various “computational” disciplines, such as computational biology, computational linguistics, etc. Consequently, computational tools are now indispensable. However, rich, visual, interactive interfaces that act as wrappers around complicated mathematical algorithms also hide the complexities and subtleties of the inner workings of those algorithms, thereby lessening the likelihood that users will engage with these methods. Dennis Tenen points to an example of astronomers looking through a telescope at an undiscovered, fascinating constellation of stars. Seen through a critical lens, however, the question is whether the astronomers are really making an exciting discovery, or whether they are witnessing the side effects of a defective telescope (Tenen, 2016).

The point is that tools are meant to be mastered, and not simply used. Without such mastery, it becomes difficult or impossible to assess the validity of the result of a computation. Tenen provides a further computational example from the Python NLTK (Natural Language Toolkit) library, a popular and widely used suite of algorithm implementations and functions for natural language processing (NLP). NLTK features a clustering function that determines groups of similar documents in a collection or corpus. For instance, documents may be grouped by the number of personal pronouns, or by sentence length. One of the algorithms for clustering in NLTK is named k-means clustering. The algorithm works by iteratively placing features into groups, or clusters, to maximize the distances between the mean (average) values of the features in each cluster. The “k” in k-means indicates the number of clusters, which is selected by the user. The k-means clustering algorithm is unsupervised, meaning that the algorithm itself determines the clusters without guidance from the user, who only needs to specify k, or the number of clusters. The k-means clustering algorithm has been used successfully for many years in a multitude of applications. It is unquestionably one of the standard functions in unsupervised machine learning. Its success, however, comes at a price. One such difficulty is selecting the number of clusters; i.e. choosing an appropriate value for k. This problem is mostly specific to the question being asked. For example, for clustering nineteenth century novels, k may be chosen to be on the order of a dozen, representing the approximate number of novel genres being analyzed. However, the quality of the final clustering result is dependent on k.

An even more difficult problem is determining the meaning of the resultant clusters, and what insights can be gained from them. There is no guarantee that the clusters correspond to sets of features that are meaningful to the analyst. This is to be expected, as the k-means algorithm is unsupervised. The technique will always return a result, but the meaning of that result may not be immediately evident. Or perhaps there is no “meaning” relevant to the questions the user is asking. The clusters are not ordered by value, significance, or any other measurable quantity.

A further difficulty is that the k-means algorithm requires initialization, and this step is normally performed randomly. While the algorithm will eventually converge, or stabilize, to a solution, the results may be different with each independent execution due to the random initialization. Consequently, although there will always be some clustering according to some set of commonalities – that is, there will always be a result – those clusters require interpretation, and, in some cases, may not have an interpretation understood by the user. Finally, k-means clustering, although listed among the most common unsupervised clustering methods, is only one of many. The NLTK library supports several clustering techniques. The user must therefore have a level of familiarity with these methods to use them and to interpret their results properly. Often, this not only requires familiarity, but a deep understanding of the underlying algorithms.

NLTK is essentially a wrapper (more user-friendly Python code for complicated code) for various machine learning algorithms applied to natural language processing. It is distributed with a full set of documentation. As is the case with most Python libraries, the code is open source, meaning that it can be examined and studied. It is also helpful that NLTK has grown a large number ofmany contributors and developers over the years, and that its expertise base is quite substantial. This mode of distribution therefore helps users with a strong mathematical background to use the library effectively. However, a feature-rich interface, or web-based application, would add levels of complexity that move the user further away from the underlying algorithms and statistical methods – or NLP logic – they employ. The NLP logic, the lowest abstraction level, is “wrapped” in Python code to form NLTK. Adding a graphical user interface adds another layer of abstraction. The “user friendliness” of the interface simplifies the usage of the tool, but also adds complexity and obscures the workings of the functions, greatly complicating interpretation of results. Tenen makes the point that even the relatively low-level Python code makes some implicit assumptions. If users were not only Python programmers but had a deep understanding of the statistical techniques implementing the machine learning algorithms, then if future releases were written in Haksell (a relatively new functional programming language) instead of Python, knowledge of the methodology would be transferrable from one language to the other (Tenen, 2016).

[NEXT]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

License

Share This Book