Data Science and Big Data
The amount of data being acquired, processed, and analyzed in humanities scholarship continues to grow at a rapid pace. Digital archives, born digital resources, images and 3D models for cultural heritage, and a vast amount of digitized textual material are contributing to immense volumes of data that are proving challenging to process and analyze. These data are often non-numeric, complex, and heterogenous; that is, they may originate from multiple sources and formats, and require combining them in flexible ways to derive the most benefit from these data (Schöch, 2013).
Consequently, methods from “Big Data” analysis being used. Big data is “relatively unstructured, messy and implicit, relatively large in volume, and varied in form” (Schöch, 2013). Big data are processed and analyzed with a variety of methods. Machine learning techniques that scale up to large volumes of data are frequently employed. MapReduce is a parallel computing model for decomposing large data into smaller units, which are then processed in parallel by specialized server systems (e.g. Hadoop). Big data are also processed with NoSQL, as described above.
Data science and data analytics are relatively new terms to describe the intersection of statistics, computation, and domain areas to acquire new knowledge from a large amount of heterogeneous data emerging from many disciplines, including the digital humanities. Data science “is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments” (Cao, 2017). These data are subsequently studied, and the multi- and inter-disciplinary approach is taken to “transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology” (Cao, 2017).
Data analytics is distinguished from standard data analysis in that data analytics explicitly addresses computational methods, tools, theoretical constructs, and technologies that provide an “in-depth understanding and discovery of actionable insight into data” (Cao, 2017). Data analytics is comprised of three subfields. Descriptive analytics is obtaining useful information from data through traditional descriptive statistical methods. In predictive analytics, predictions are made concerning unknown future events. These predictions are usually made based on historical data and descriptive statistics calculated on these data. Advanced statistical techniques and machine learning algorithms are used for predictive analytics. Prescriptive analytics optimizes the decision-making process and facilitates recommendations and actions for “smart decision-making” (Cao, 2017).