Big Data in The Digital Humanities
Introduction
As is the case with many of the terms addressed in the discussion of the digital humanities (including the definition of “Digital Humanities” itself), Big Data, both as a term and as a concept, does not have a clear genealogy. It is defined in the Oxford English Dictionary as “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges” (quoted in Kaplan, 2015). This dictionary definition emphasizes a specific characteristic of “Big Data” – that it is not conducive to manual analysis, necessitating the need for new computation and interpretation techniques. Another obvious characteristic of Big Data is that it is indeed “big”, not only in terms of its size, but in its relationships to other data — that is, Big Data constitutes data that are “fundamentally networked”, and, consequently, processing Big Data must take these networked characteristics into account. A concise definition of “Big Data” is given by Cao. Big Data “[r]efers to data that are too large and/or complex to be effectively and/or efficiently handled by traditional data-related theories, technologies, and tools” (Cao, 2017).
However, it must be emphasized that “small data” are also crucially important to the digital humanities. “Small data” concerns more focused research and analysis that does not require the advanced techniques required for processing and analyzing vast quantities of “Big Data”. Consequently, “small data” is, of course, “smaller” than “Big Data”, but is also more focused, with better defined boundaries.
Finally, in discussions of the technical definition of “Big Data”, the “Three [or four, or five] V’s” inevitably arise. In an early definition, there were 3 “v’s”: volume, variety, and velocity.
Volume refers to the size of the data, usually extremely large quantities of data that comprise typical “Big” datasets. Although there is no upper or lower bound, it is considered a (very loose) heuristic that “Big Data” must have a size at least on the order of terabytes (1 TB is exactly 240, and approximately 1012 bytes).
Variety refers to the nature and type of the data, usually with emphasis on its heterogeneity. In addition to standard, “structured” data that can be represented in a table (or relational database), unstructured text (such as Twitter “Tweets”), images, audio, video, etc. can be considered as data types for “Big Data”. Unstructured data is particularly relevant for Big Data. While structured data conforms to “rules” and can easily be characterized (e.g., integers and numbers in general, geographic locations that are represented as latitude/longitude or as text names), unstructured data do not conform to such easy characterization. Videos, audios, and even “Tweets” must be interpreted by human users. Consequently, one of the challenges for Big Data technology is to process and aid in interpreting these unstructured data.
Velocity refers to the speed at which the data are generated and processed. With continuously arriving data from sensors and social media, data related to “Big Data” are captured at a very high frequency.
In addition to the three primary “v’s” of Big Data, sometimes a fourth “v” is added: veracity, which denotes how trustworthy or “certain” data are, and how easily anomalies can be detected and handled. Finally, a fifth “v”, value, is often indicated. Value denotes the expected insights that can be drawn from an analysis.
In the digital humanities, one possible way of addressing the challenges in “Big Data” (BD) for the digital humanities is to consider these challenges as comprising three subtasks:
- processing and analyzing “big cultural data”, which is encompassed by;
- digital culture, which is in turn encompassed by;
- digital experiences (see Figure 1 in Kaplan, 2015).
The first subtask, Big Data processing and analysis, concerns the data processing pipeline. It is comprised of research concerning the methods and practices of large databases, and Big Data processing that results in new insights and new types of understanding. This task requires the participation of computer scientists and software developers, as many methods for processing and analyzing big data have not yet been invented. However, data processing can follow the basic linear pipeline. In the first stage, analog (non-digital) data and physical artefacts, such as texts, must first be digitized. Sensors and capture systems assist in this process. Additionally, metadata information must be attached to the digitized result for documentation purposes. Specific questions include what techniques to apply to these problems that produce the results sought after.
The next stage involves transcription. Digitized texts must be transcribed, either manually, semi-automatically, or automatically. For instance, texts must be “read”. Features in paintings or other artwork, or in digital photos must be extracted and recognized. Video and audio content must be segmented (broken down into component parts) and subsequently transcribed. A pertinent concern in this stage is the assessment of errors and biases within the transcription process.
Pattern recognition follows transcription. In this stage, common patterns are detected from collections of artefacts, such as artwork or physical models that have been digitized. For text, names of people and geographic locations are identified. Semantic graphs of data are generated. Specific questions include how to reconstruct and analyze relationship networks and patterns from these data sets.
Simulation and inference deal with inferring new data on the basis on available, analyzed data sets. This possibly includes simulating missing data based on the patters that have been detected in the previous stage. Analogously to the case of error and bias assessment in pattern recognition, a relevant problem is assessing uncertainty with this inferred or simulated data.
The last stage in processing is preservation and curation, dealing with questions of storage, data redundancy, and whether storage should be centralized or decentralized. A major concern in this stage is data privacy, security, and authenticity.
The second major task for Big Data in the digital humanities concerns digital culture. This category of tasks and problems involves how a digital culture could be structured around a network of relationships between the new digital entities generated in the first category of tasks — entities that are possible only through technological and computational advances, as described above. Questions addressed by this task include: the effect of new technology on redefining scholarly discourses; the effect of “open peer review” of scholarly publishing; embedding new media, such as videos, visualizations, and simulations into scholarly publishing; and the status of interactive visualizations, which straddles the domains of visual analytics and human-computer interaction. Another very consideration is bias in search engines, auto-completion, machine translation, and other text transformation. Finally, issues such “ownership” of the data and software, and the role of “global IT actors” (e.g., search engine developers and social media companies, as well as educational institutions) must be addressed.
Whereas the first category of tasks focuses on the relationship between software and data, the second focuses on digital culture. Here, the relationships are between the new, digital entities, including those among as digital communities, such as asynchronous classrooms and Wikipedia entries, and collective discourses, such as blog posts and collaborative writing for websites, ubiquitous software (e.g., search engines), and large, global IT actors.
The third set of subtasks, subsuming the previous two, relates to digital experiences and the realization of big cultural datasets in the “real”, physical world. Alternatively, it addresses “how challenges dealing with the experience of digital data can be described using the continuous space of possible interfaces” (Kaplan, 2015). The challenges in this stage are immersive, abstractive, and linguistic. For instance, the degree to which immersion into a data set is accomplished depends on whether a “3D world”, virtual reality, or augmented reality is used in museums, schools, or other forums, or whether a 2D network graph visualization clearly indicating relationships may be more effective in a given context. At the level of abstraction, issues involving the degree of density or sparsity of representations of large data sets, as well as challenges with navigating abstractions of these data sets, must be considered. Related to the synergistic relationship between visual analytics and the digital humanities, an important factor is the degree to which visualizations assist humans’ cognitive abilities to detect new patterns or to spot anomalies. Finally, at the linguistic level, challenges include the sorting, organization, and subsequent visualization of large quantities of text. Related to this challenge is whether and in what way distant and close reading can be combined.
Although alternative maps or images for describing the challenges of Big Data in the digital humanities exist, the three categories of challenges described above serve as a useful tool for understanding not only the technical problems in Big Data processing (which undoubtedly exist), but also for understanding the relationships between technical and non-technical concerns (Kaplan, 2015).