What Is (Are) Data?
Introduction
Digital humanities scholar Lev Manovich develops a concept of data that focuses on the interface between computation and the humanities. Manovich discusses data as that which can be read, transformed, and analyzed computationally, and that this way of viewing data “… imposes fundamental constraints on how we represent anything” (Manovich, in Paul, 2019).
At a basic level, data can be defined as follows: “A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen” (Consultative Committee for Space Data Systems 2002, 1-9, cited in Borgman, 2009).
Data can also be defined by where and how it originated. Four widely-accepted categories of data were identified in a United States National Science Foundation policy report from 2005 (Long-Lived Data Collections):
- Observational data, which include measurements (weather measurements provide a simple example) that are commonly associated with specific spatial coordinates or specific times or with multiple places and times (longitudinal studies provide an example of the latter).
- Computational data are not specifically measured, but result from a computer model or simulation, including from virtual reality. It must be emphasized at this point that model replication and verification is a major issue with computational data, and requires extensive documentation of the algorithms, software, and, increasingly, the input data used in the model and the output results it produces.
- Experimental data is well-understood and includes experiments from laboratory results or from field experiments.
- Records, including those from public and private life. All these data sources can be useful for various types of research, including humanities scholarship (Borgman, 2009).
However, even something as objective as “data” also has a more subtle, subjective dimension, as what may sometimes count as “data” (or signal) may also equally be considered in another point of view as “noise”. Therefore, data can be considered as “alleged evidence”, and is heavily context dependent. In fact, to use an example from Borgman, it is possible in environmental sciences to argue over views of data as fundamental as temperature, whereas an engineering viewpoint may state that “temperature is temperature”. Biologists have even more complex (and nuanced) descriptions of temperature that take into account how this quantity is measured (Borgman, 2009).
Given these issues, humanities scholars must be cognizant to assumptions that they make on their data, as well as their sources, and epistemology. As Borgman states: “[w]e are only beginning to understand what constitute data in the humanities, let alone how data differ from scholar to scholar and from author to reader.” She then cites a 2009 personal communication with philosopher Allen Renear: “In the humanities, one person’s data is another’s theory” (Borgman, 2009).
Measuring the Size of Data
The size of data in digital contexts uses metric system prefixes.
1 byte = 8 bits
1 kilobyte (KB) = 210 bytes = 1024 bytes » 1000 bytes
1 megabyte (MB) = 220 bytes = 1,048,576 bytes ≈ 1 million (1,000,000) bytes
1 gigabyte (GB) = 230 bytes = 1,073,741,824 bytes ≈ 1 billion (1,000.000,000) bytes
1 terabyte (TB) = 240 bytes ≈ 1 trillion (1,000,000,000,000) bytes
1 petabyte (PB) = 250 bytes ≈ 1 quadrillion (1015) bytes
1 exabyte (EB) = 260 bytes ≈ 1 quintillion (1018) bytes
1 zetabyte (ZB) = 270 bytes ≈ 1 sextillion (1021) bytes
1 yottabyte (YB) = 280 bytes ≈ 1 septillion (1024) bytes
Note that one kilobyte is not exactly 1000 bytes, one megabyte is not exactly 1,000,000 bytes, etc. The number of bytes in a kilobyte, megabyte, gigabyte, etc. are, however, represented perfectly by powers of two.
WHAT IS DATA? A FORMULA
The importance of data, databases, and incorporating data into workflows in the digital humanities has gained attention recently. Lev Manovich, one of the leading digital humanities theorists and pioneer in the interface between data science and the digital humanities, employs a database approach as a crucial component of new media being employed. It is increasingly clear that the digital humanities field is data-driven and is dependent on content sources in new media such as data archives, textual, and audio-visual sources. Consequently, data-centric workflows are required to expand and improve the techniques employed in humanities research (Lugmayr & Teras, 2015).
Manovich has the formula: Data = Objects + Features: “Together, a set of objects and their features constitutes the “data” (or “dataset”) that we can work with using computers”, and “…
‘data’ is something a computer can read, transform, and analyze” (Manovich, in Paul, 2019).