Big Data And “Smart Data” In the Digital Humanities
INTRODUCTION
Data (plural of singular datum) are fundamental to the digital humanities, and the relationship between humanities disciplines and data science is becoming more evident as advanced data analysis and visualization techniques are increasingly important for analyzing the large amount of data studied in the humanities. A challenging aspect of this humanities data is its heterogeneity, and that it is situated in a large number of small data sets (such as documents), instead of in a small number of large data sets, as is found in other “Big Data” applications in science, engineering, medicine, and finance. Additionally, much of the data from these latter disciplines are self-descriptive, meaning that the data do not need additional interpretation. For instance, the number “2” is straightforwardly interpreted as two. Numeric or character data have a natural representation in the binary code and is therefore naturally conducive to algorithmic manipulation. Data semantics are inherent in the data themselves. Humanities data, however, are mostly textual, and semantics must be subsequently added.
The importance of data is underscored by its addition as the “Fourth Pillar” of scientific endeavor, and of scholarship in general. Traditionally, theory and experimentation have been, and are, paradigmatic for scholarly activities. In the latter part of the 20th century, computation was added as the “Third Pillar”, due to its decisive importance in processing, modeling, simulation, and analysis. More recently, with the explosive interest in data, particularly Big Data, and with the advent and widespread acceptance of data-driven research and the recognition that standard data analysis techniques and database management approaches are insufficient to address these new data-related challenges, data itself has been added as the “Fourth Pillar”.
Data are actual representations of objects of interest, such as a person’s name or location. Metadata refers to information that describes data in a particular location or stored in a particular way. In other words, metadata is “data about the data”. For instance, metadata contains the date and time the data were created, what is in the data (e.g., names or address are stored), and general or specific information as to the referent of the data. In the humanities, data are “a digital, selectively constructed, machine-actionable abstraction representing some aspects of a given object of humanistic inquiry.” (Schöch, 2013) (4).
WORKING WITH DATA
To understand and work with data, it is necessary to describe how data are arranged, particularly when these data need to be processed computationally. There are various linear arrangements of data, including arrays and arrays in higher dimensions, such as matrices (2D arrays). 2D arrays are used in many applications. An image – specifically, a grey-scale image – is a canonical example of a 2D array. Images themselves are two-dimensional, and, when digitized, consist in an orderly row-column arrangement of small, individual picture elements, known as pixels. Each element of the 2D array or matrix, identified by its row and column position, contains a value indicating the brightness of the pixel. In grey-scale images, this value may be a floating-point number between 0 (black) and 1 (white), or, in 8-bit representations, range from 0 (black) to 255 (white). 3D arrays can be used to represent colour images. In addition to the rows (the first dimension) and columns (the second dimension), three additional values, indicating the colour, constitute the third dimension. In the RBG model, colours are represented by combinations of red, green, and blue values. In an 8-bit model, each colour can assume 28 = 256 values, ranging from 0 to 255. Since there are three components (red, green, and blue), (28)3 = 224 = 16,777,216 colours can be represented. Many other data in science and engineering – and, increasingly, the humanities – can be represented as 3D data. Frequently, a fourth dimension, time, is added to represent dynamically changing 3D phenomena. Many complex applications employ higher dimensional data, frequently dedicating a dimension to time.
Hierarchical arrangements are also used to represent data. Trees comprise a typical example of a hierarchical data structure. Trees consist of nodes, with child nodes hierarchically related to parent nodes, which have a higher position in the hierarchy. Hierarchical data structures can become complex, with child nodes having one or more child nodes of their own, thereby playing the role of both child and parent. XML, the Extensible Markup Language, which is used for tagging, is modeled in this way, as is TEI (Text Encoding Initiative) data, which is crucially important in literary and textual scholarship.
Data are also used to represent relationships. Graph data structures, for example, are used for modeling and visualizing social networks.
Data may be structured or unstructured. Structured data is data that can be stored in tables and databases, or in other orderly arrangements. Unstructured data, however, requires special storage, processing, and analysis methods. Although long strings of character data are used to represent text, and these strings can be stored in a database as a single field, the text itself, specifically its content, is not conducive to a tabular or ordered data structure. Consequently, plain text is a prime example of unstructured data.
Data may also be semi-structured. In semi-structured data, such as represented by XML, unstructured text can be organized according to specific schema, enforced by tags, annotations, and markup.
Although it is poses greater processing and analysis challenges, unstructured data is the key component for a large amount of humanities scholarship. Text are generally the raw materials of such scholarship, and text has been the traditional object of study for “big” and “smart” data, although other multimedia, such as images, video, and audio are increasingly used in these research endeavors, especially in the increasing number of applications to non-textual media such as film and media studies (Zeng, 2017). Consequently, there is a need to develop methods to facilitate this processing and analysis.
BIG DATA AND SMART DATA
Two terms are frequently employed in discussions about both the nature and utility of data in humanities scholarship. The first, better known term, is Big Data. As its name implies, Big Data is characterized by its volume, but also by other factors that will be discussed below. However, although as a term, Big Data has been used for many years, a precise definition still includes descriptions, such as volume, heterogeneity, etc. The other term, less frequently used, is smart data. Smart data, as a term, is less well-defined than Big Data, and there is no overarching consensus as to the connotations of the term. Whereas Big Data refers to data characteristics, smart data generally refers to how the data are used, particularly in the context of computational processing.
Although data are representations, they are useful primarily because they can be manipulated computationally. Their digital representation in the binary code as 0s and 1s make such processing and manipulation possible. Digitization leads to two possible paths. The first is to curate, annotate, and, most importantly, structure the data into “smart data”. The semi-structure is what makes the data “smart”. This is the case with critical digital editions of text (the same could be said of music scores). The second path is to accumulate a large amount of digitized data, where its structure and annotation is secondary. Algorithms and machine learning are needed here to make sense of this data. This is the route of “Big Data”, where the raw, unprocessed data are computationally processed and analyzed.
As mentioned above, Big Data is usually defined by delineating their three most characteristic features: volume, variety, and velocity. Velocity in this context refers to the constantly arriving “stream” of data from scientific instruments, sensors, surveillance systems, or social media posts. However, this feature has limited utility for the humanities. It is generally not that important whether data arrive very rapidly or less so. However, the first two features are more relevant. Humanities data are generally quite voluminous. Large corpora of hundreds or thousands of texts make many demands on storage capacity and pose processing and analysis challenges. Additionally, because texts are studied extensively in various scholarly undertakings, data are usually unstructured, or, at best, semi-structured. Big Data, because of its heterogeneity – that is, that a dataset contains data in different formats and from varied sources, is not conducive to being represented in the standard table format that is the mainstay of relational databases that employ the Structured Query Language (SQL) that drives the queries of relational database management systems.
NoSQL, which represents data in key-value pairs instead of in a 2D table, Hadoop technology for storage and retrieval, and MapReduce for parallel processing of blocks of data, all facilitate Big Data processing of unstructured or semi-structured data. New database paradigms based on graphs for multi-relational data are also actively being studied in the context of Big Data.
BIG SMART DATA – SMART BIG DATA
Big Data is characterized by a lack of structure, a very large volume (hence the name “big” data), and that it is heterogeneous, or varied in its representations. Smart data, in contrast, are semi-structured, or sometimes structured (recall, for instance, that XML can be considered as semi-structured data), smaller in volume, and more uniform in its representations. Smart data also integrates different types of data, including Big Data, and is more conducive to computational analysis than the latter.
In relation to Big Data, smart data makes sense out of data, and, in the case of Big Data, transforms them into actionable elements to support decision-making processes (Zeng, 2017). Compared to Big Data, smart data is relatively “clean”, in the sense that the noise inherent in the acquisition or collection process or as the result of human errors has been removed or reduced. The main purpose of smart data is to enable achieving important new insights from data, wherever it lies on the “Big Data” to “small data” spectrum. The “value” (one of the new “v’s” of Big Data) comes from its transformation into smart data (Zeng, 2017). This transformation is accomplished through a variety of approaches, including machine learning, deep learning and deep neural networks, predictive analytics, semantic technologies, text analysis, natural language processing, advanced techniques from graph theory, a plethora of techniques from data science and data analytics, and human common sense (Zeng, 2017).
Semantic technology, just mentioned, is used so that computational processing can be performed on data that is not in itself conducive to such processing; that is, it is used to encode semantics into originally semantics-less data. The Resource Description Framework (RDF), a W3C (World Wide Web Consortium) specification, is an example of such a technology. The purpose of RDF was originally to standardize metadata, which, as “data about the data”, is a straightforward approach for adding semantics to data. Its application now includes modeling information found in web resources and knowledge management (methods for creating, using, and managing information within a specific domain). One of the advantages of RDF is that it enables inferences to be drawn, and complex queries to be performed on data. RDF is foundational for linked data, which are collections of inter-related Web datasets in which the relationships between them are available. These relationships among data, as well as access to the data, are necessary for realizing the goal of the semantic web and integration and reasoning on the vast amount of Web data (see the W3C entry on Linked Data). Consequently, RDF provides the standard format for linked data.
Knowledge graphs, or semantic networks, are also useful for transforming Big Data into smart data. These graphs represent entities from the real world in a graph topology. Their main purpose is to integrate data by representing interconnections between entities, which include real-world persons, objects, events, or abstract concepts. Like RDF, knowledge graphs are associated with efforts to realize linked data. They are also employed in search engine technologies.
Knowledge graphs are data structures that represent entities and the semantic relationships existing between these entities. Because of their digital representation, they are useful in resource discovery and retrieval, as well as for navigating and visualizing large amounts of interrelated data. Consequently, they have wide applicability in many domains, and particularly in libraries and in the humanities, where knowledge organization systems –classification schemes, taxonomies, thesauri, glossaries, or other types of vocabularies – have traditionally been used in the pre-digital era, due to the need for authoritative data on people (prosopographies) and places (gazetteers). With advances in digital technologies and computational approaches, and particularly the emergence of the World Wide Web and linked data, metadata associated with knowledge organization systems is becoming increasingly interconnected. As the Web is a decentralized system, this emerging “knowledge network” benefits from community participation in the curation and editorial processes, as is the case with Wikipedia. One of the goals of these interconnected knowledge graphs is to formalize and to connect insights gained from the analysis of large corpora that are of interest in humanities scholarship (Haslhofer et al., 2018). An example of such a project is the Pleiades Gazetteer of the Ancient World.
For knowledge graphs and linked data to achieve their potential, new computational techniques and improvements to existing techniques are needed for visualizing, annotating, and analyzing large digital corpora and the parts of these collections that are of specific interest to scholars. For facilitating scholarship, users may apply relevance weights or select concept definitions in knowledge graphs. As in many areas of the digital humanities, enhanced text-mining and machine-learning techniques that scale to a large amount of data or large-scale corpora are needed for analyzing and comparing concept attributes and relationships expressed in knowledge graphs both within and across a range of corpora that can possibly span long time periods. The latter point is important because syntax and semantics in language change over time. An important enhancement to text mining algorithms is adapting to scholars’ annotations and enabling a thorough investigation of semantic relationships extracted from large corpora. Consequently, a focus on developing scalable tools and techniques is needed to enable aligning large scale (possibly multimedia) corpora with concepts expressed in knowledge graphs (Haslhofer et al., 2018). These improvements in the utility of knowledge graphs underscore the importance of interdisciplinary collaboration between scholars in the computational and humanities domains, with crucial contributions from library science. Particularly important for the humanities domains, these innovations in computational processing of knowledge graphs will also advance mixed qualitative and quantitative methods for analyzing large-scale digitized corpora (Haslhofer et al., 2018).
TEI is a prototypical example of transforming data into smart data. Emerging from the same set of concerns that drove the development of XML, it is semi-structured. It follows a schema enforced by tags, but the schema is flexible. It delineates every part of text, including line and page breaks. The structure in TEI documents make them conducive to visualization and analysis. For example, in (Schöch, 2013), Schoesch (2013) described a study of literary descriptions in the 18th century novel, with the goal of identifying all descriptive passages in a corpus of novels published between 1760 and 1800. These passages were then subjected to further analysis of literary stylistics. A database of descriptive passages was constructed, using a bibliographic reference management system as the front end. The passages were tagged with features that were deemed relevant for the research. There were 1,500 entries of tagged descriptive writing. This tagging, whereby data was transformed into “smart data”, enabled the discovery of usage patterns and trends, and correlations that were previously undetected (Haslhofer et al., 2018).
However, there is a drawback to smart data. Because it heavily relies on human intervention, it is not scalable. Although the TEI encoding can be partially automated, and with evolving machine learning techniques, more automation in the future is likely, putting text into context and describing it with tags and annotations must be performed by humans, as algorithmic processes that would enable its automation have not yet been discovered. Therefore, creating smart data is a time-consuming task, precluding the generation of large quantities of such data (Schöch, 2013). Consequently, the question becomes whether there is still a place for Big Data in the humanities. Given that Big Data do scale well and do not require a high level of human intervention for processing, humanities scholars are considering the relative merits of Big Data vis-a-vis “Smart Data”.
In the context of Big Data and smart data, there are two different but related “shifts” that can be observed. The shift from print media to tagged, semi-structured smart data is certainly considered as a methodological shift. However, the move from smart data to Big Data involves much more fundamental changes, akin to the shift from close reading to distant reading. Analysis shifts from semantics, interpretation, and qualitative metrics to statistical and quantitative values. Entire bodies of literary works for specific genres, time periods, or cultures can be studied, and, consequently, questions of what is “representative” of that body of work, or the literary quality of individual works, becomes less relevant. Consequently, the defining characteristic of Big Data in the humanities is not a primarily technological shift, but rather a methodological one (Schöch, 2013).
In terms of computational tools for Big Data, cluster analysis, unsupervised machine learning, and principal component analysis (PCA) are among the most popular. PCA is a dimension reduction technique, and is therefore useful for simplifying large, complex, high-dimensional data sets. An important factor that should be considered is scale. Because of the large data volume and computational constraints, Big Data is often analyzed at a low resolution. Consequently, outliers are generally implicitly smoothed out, which is particularly problematic for humanities research, where outliers, far from being considered “noise”, may lead researchers to some of the most interesting and insightful findings. This problem underscores the importance of visualization in Big Data analysis, as it allows data to be analyzed at different resolutions. In this way, both overall trends and outliers can be investigated. Visualization of Big Data is also valuable for indicating directions that should be taken for subsequent analysis. The further development of Big Data techniques is important and necessary because many of the techniques for analyzing Big Data can also be used for “small” or “mid-sized” data, which are also major components of humanities scholarship.
With smart data, Big Data analysis may become more conducive to humanities scholarship. Application-specific techniques can be applied to smart data that are more explicit. Furthermore, ambiguities, which are generally not considered to be outliers, are explicitly treated.
CONCLUSION
Big Data and smart data should be seen as complementary, not competitive, paradigms. Two factors that may contribute to the convergence of Big Data and smart data is automatic annotation, making Big Data “smarter”, and crowdsourcing. Crowdsourcing is a form of human “parallel computation”. Large tasks are decomposed into smaller ones that are distributed and subsequently undertaken by volunteers, and approaches such as gamification makes it worth their while. Research conducted with crowdsourcing is particularly relevant for improving the quality of optical character recognition (OCR), especially for pre-1800 writing (Schöch, 2013).
Both Big Data and smart data are needed. Big Data ensures that problems of underrepresentation are reduced and improves the robustness of statistical results that are calculated from the data. That is, statistics are less reliable and robust when small sample sizes are used. However, for these large collections, rich metadata and annotations are needed for the analyses relevant to humanities scholarship. Consequently, there is a need for “bigger smart data” as well as “smarter big data” (Schöch, 2013).
Christof Schöch, Professor of Digital Humanities at the University of Trier (Germany) summarizes: “…we need smart big data because it can not only adequately represent a sufficient number of relevant features of humanistic objects of inquiry to enable the level of precision and
nuance scholars in the humanities need, but it can also provide us with a sufficient amount of data to enable quantitative methods of inquiry that help us transgress the limitations inherent in methods based on close reading strategies. To put it in a nutshell: only smart big data enables intelligent quantitative methods” (Schöch, 2013) (11).