What Is (Are) Data?

Mark P. Wachowiak

What Is (Are) Data?

Introduction

Digital humanities scholar Lev Manovich develops a concept of data that focuses on the interface between computation and the humanities. Manovich discusses data as that which can be read, transformed, and analyzed computationally, and that this way of viewing data “… imposes fundamental constraints on how we represent anything” (Manovich, in Paul, 2019).

At a basic level, data can be defined as follows: “A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen” (Consultative Committee for Space Data Systems 2002, 1-9, cited in Borgman, 2009).

Data can also be defined by where and how it originated. Four widely-accepted categories of data were identified in a United States National Science Foundation policy report from 2005 (Long-Lived Data Collections):

Observational data, which include measurements (weather measurements provide a simple example) that are commonly associated with specific spatial coordinates or specific times or with multiple places and times (longitudinal studies provide an example of the latter).
Computational data are not specifically measured, but result from a computer model or simulation, including from virtual reality. It must be emphasized at this point that model replication and verification is a major issue with computational data, and requires extensive documentation of the algorithms, software, and, increasingly, the input data used in the model and the output results it produces.
Experimental data is well-understood and includes experiments from laboratory results or from field experiments.
Records, including those from public and private life. All these data sources can be useful for various types of research, including humanities scholarship (Borgman, 2009).

However, even something as objective as “data” also has a more subtle, subjective dimension, as what may sometimes count as “data” (or signal) may also equally be considered in another point of view as “noise”. Therefore, data can be considered as “alleged evidence”, and is heavily context dependent. In fact, to use an example from Borgman, it is possible in environmental sciences to argue over views of data as fundamental as temperature, whereas an engineering viewpoint may state that “temperature is temperature”. Biologists have even more complex (and nuanced) descriptions of temperature that take into account how this quantity is measured (Borgman, 2009).

Given these issues, humanities scholars must be cognizant to assumptions that they make on their data, as well as their sources, and epistemology. As Borgman states: “[w]e are only beginning to understand what constitute data in the humanities, let alone how data differ from scholar to scholar and from author to reader.” She then cites a 2009 personal communication with philosopher Allen Renear: “In the humanities, one person’s data is another’s theory” (Borgman, 2009).

Measuring the Size of Data

The size of data in digital contexts uses metric system prefixes.

1 byte = 8 bits

1 kilobyte (KB) = 2¹⁰ bytes = 1024 bytes » 1000 bytes

1 megabyte (MB) = 2²⁰ bytes = 1,048,576 bytes ≈ 1 million (1,000,000) bytes

1 gigabyte (GB) = 2³⁰ bytes = 1,073,741,824 bytes ≈ 1 billion (1,000.000,000) bytes

1 terabyte (TB) = 2⁴⁰ bytes ≈ 1 trillion (1,000,000,000,000) bytes

1 petabyte (PB) = 2⁵⁰ bytes ≈ 1 quadrillion (10¹⁵) bytes

1 exabyte (EB) = 2⁶⁰ bytes ≈ 1 quintillion (10¹⁸) bytes

1 zetabyte (ZB) = 2⁷⁰ bytes ≈ 1 sextillion (10²¹) bytes

1 yottabyte (YB) = 2⁸⁰ bytes ≈ 1 septillion (10²⁴) bytes

Note that one kilobyte is not exactly 1000 bytes, one megabyte is not exactly 1,000,000 bytes, etc. The number of bytes in a kilobyte, megabyte, gigabyte, etc. are, however, represented perfectly by powers of two.

WHAT IS DATA? A FORMULA

The importance of data, databases, and incorporating data into workflows in the digital humanities has gained attention recently. Lev Manovich, one of the leading digital humanities theorists and pioneer in the interface between data science and the digital humanities, employs a database approach as a crucial component of new media being employed. It is increasingly clear that the digital humanities field is data-driven and is dependent on content sources in new media such as data archives, textual, and audio-visual sources. Consequently, data-centric workflows are required to expand and improve the techniques employed in humanities research (Lugmayr & Teras, 2015).

Manovich has the formula: Data = Objects + Features: “Together, a set of objects and their features constitutes the “data” (or “dataset”) that we can work with using computers”, and “…

‘data’ is something a computer can read, transform, and analyze” (Manovich, in Paul, 2019).

[Work Cited]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

License

Share This Book