Introduction To Data Science in The Digital Humanities
Introduction – What Is Data Science?
Data science has been defined, trivially, as “the science of data” (Cao, 2017). A somewhat more detailed and broad description is given by data scientist and historian Benjamin Peck: “…data science is generally agreed to be the practice of applying the scientific method to extract insights from data in order to generate predictions, drive actions, and guide further inquiry” (Benjamin Peck, 2020).
However, these definitions can be unpacked based upon a variety of different perspectives. Although a high-level (overall) definition is, given above, is: “data science is the science of data” or “data science is the study of data”, another definition can be formulated, emphasizing the interdisciplinary nature of data science: “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments”. This definition takes into account many (but not all) the disciplines that contribute to “data science”. However, one can proceed further. These data are subsequently studied, and the multi- and inter-disciplinary approach is taken to “transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology” (Cao, 2017).
Yet another definition can be given in terms of “data products”, which are deliverables or results that are obtained from data or enabled by data. These deliverables take the form of insights, discovery, and prediction that can obtained from data and its analysis. These insights are subsequently transformed into decision-making. Consequently, the final “data products”, or deliverables, are “knowledge, intelligence, wisdom, and decision” (Cao, 2017).
Several specifically data-related fields are ancillary to data science. Cao delineates and provides clear, concise, and thorough descriptions of these fields. For the purposes of the digital humanities, the following areas seem to be the most important (see (Cao, 2017), especially Table 1).
Data analysis refers to the processing of data by widely accepted statistical, mathematical, logical, or computational methods, tools, and technologies. The goal of data analysis is to transform data into useful, practical information.
Data analytics is distinguished from data analysis in that the former addresses computational methods, tools, theoretical constructs, and technologies that provide an “in-depth understanding and discovery of actionable insight into data” (Cao, 2017). Data analytics is comprised of three subfields that are described below.
The first subfield of data analytics is descriptive analytics, which is the type of data analytics in which useful information is obtained from data through traditional statistical methods. For instance, one can gain valuable information about a phenomenon through the mean, standard deviation, minimum value, and maximum value of a set of data.
The second subfield is predictive analytics, in which predictions are made concerning unknown future events. The reasons for these predicted events are generally in the domain of advanced analytics, defined below.
The third subfield is prescriptive analytics, which is the type of data analytics that optimizes the decision-making process and facilitates recommendations and actions for “smart decision-making”.
Another important aspect of data science is advanced analytics, which deals with the theoretical underpinnings, tools, and technologies that “enable an in-depth understanding and discovery of actionable insights in big data, which cannot be achieved by traditional data analysis and processing theories, technologies, tools, and processes” (Cao, 2017).
Although the preceding paragraphs provide clarity, they should not be taken to suggest that data science is entirely based on technologies. Data science, of course, has been practiced long before the term came to be in vogue. Furthermore, as Peck emphasizes, data science should not be elided with artificial intelligence, machine learning, and, in particular, with “Big Data”, although all these concepts represent outward-facing side of data science.
For the digital humanist, arguably the most important aspect of data science is that it augments and enhances the interpretive functions of a discipline (Benjamin Peck, 2020).
Like the descriptions given above, digital humanities scholar Lev Manovich of the City University of New York notes in a recent article that data science is a new and increasingly popular discipline that includes classical statistics and new techniques to process and analyze “big data”. This importance is recognized in many professional fields that are beginning to employ data scientists for predictive purposes, and, importantly, to extract value from vast amounts of data (Manovich, 2019). (The article is available at available HERE.)
To process these data, data science researchers and practitioners develop and apply algorithms to automatically extract various statistics from large data objects. These statistics serve as concise, summarized descriptions of the objects, known as features, that can be easily understood and interpreted. Feature extraction is the process of computing and obtaining these features.
For data consisting of strictly numerical values, descriptive statistics are typical features. One such feature is the arithmetic mean, or average, which is a measure of the “centre” of the data. Another widely used feature is the standard deviation, which quantifies the variability of the data about the mean. The mean of multidimensional data can also be calculated. For example, for 3D data, indicating, perhaps, the position of an object in some reference space, the mean, or in this case centroid, of the data is a single 3D point consisting of the mean along each of the dimensions. For more complex objects, defining features and calculating them becomes more challenging. In optical character recognition, for example, features of an individual letter may include lines, open and closed loops, or intersections, which are extracted through image processing techniques. Features extracted from greyscale or colour images include edges, brightness, contrast, noise characteristics, as well as individual objects in the image that are extracted through a process known as segmentation, which calculates and separates partitions or components in an image. 3D architectural models have sets of features describing colour, shape, surface area, and volume attributes, as well as geometric features. In general, to properly characterize these complex data objects, more features must be extracted.
More specific questions pertaining to data subsequently arise. The principal goal, according to Manovich, is to represent phenomena as data because data are “computable”. That is, a computer can read, transform, and analyze data. To do this, the boundaries of a phenomenon must be determined, and it must first be decided what information is included in the analysis so that the task becomes manageable. Manovich provides the example of “modern art” as the phenomenon to be investigated, in which time periods, geographic regions, and particular artists may define the boundaries (Manovich, 2019).
The question then arises as to the data objects that are to be studied. What these objects are is, of course, dependent upon the context and the area being studied. If works of art are to be represented, then data objects may include artists, correspondences and connections between artists, names of artworks, and non-numerical data such as reviews, passages in art books, and social media correspondence. For the medical aspects of a hospital, data objects include people, such as doctors, nurses, administrators, and patients, laboratory test results, medical forms and medical records, medical procedures that are performed, and medical image data. These objects are variously known as data points, records, samples, measurements, as well as independent and dependent variables in the case of statistical analysis.
Data objects also have various characteristics, including properties, attributes, features (described above), and, very importantly for the digital humanities, metadata, which are “data about the data”, or data describing or clarifying the data object to which it corresponds.
Consequently, according to Manovich, the three principal considerations in representing phenomena as data are (1) determining what are chosen as objects; (2) selecting features that characterize the data; and (3) the encoding of the selected features. The considerations make data representations “computable, manageable, knowable and shareable”, and thus conducive for analysis and interpretation through the techniques of data science (Manovich, 2019).