Working with Data
11 Digital Preservation of Research Data
Grant Hurley and Steve Marks
Learning Outcomes
By the end of this chapter you should be able to:
- Identify threats to the long-term accessibility of digital research data.
- Develop a plan for the preservation of a given dataset in the context of a defined Designated Community (DC) and its expected use case.
- Determine whether possible preservation actions positively contribute to the long-term accessibility of a given dataset.
Introduction
Digital preservation is commonly defined as “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (Digital Preservation Coalition, 2015). Whether these materials were born-digital or digitized from another source, this goal remains the same. Although digital preservation is a relatively new field (at least compared to physical preservation!), the preservation of research data has been a part of its study since the beginning. In fact, one of the formative documents of most modern approaches to digital preservation, the Open Archival Information System (OAIS) model, was developed by a consortium of space agencies to help deal with the problem of access to historical space mission data.
The goal of this chapter is to introduce some of the basic concepts of digital preservation, with a focus on practical approaches to common problems and solutions that you may be faced with as you look at preserving research data for the long term.
Threats to Objects Over Time
Maybe one of the easiest ways to understand the risks to digital objects (including research data) over time is to put ourselves in a scenario. Imagine that you’ve come across a stack of old 5.25-inch floppy disks that you believe contain some sort of useful data: research logs from a predecessor of yours or historical data from your field of study or anything else you can imagine. It doesn’t matter — the only thing that matters is you want what’s on those disks!
However, the drives that read this type of disk are no longer standard issue with computers. In fact, they can be difficult to find in working condition. This illustrates our first threat: media obsolescence . Our storage media — in this case the floppy disks — require certain configurations of hardware and software in order to be read. When the necessary hardware is no longer available (or difficult to obtain), the media can no longer be used and is said to be obsolete.
For the purposes of this module, let’s assume that we were lucky and able to get our hands on a working 5.25-inch floppy drive. We put our first disk in the drive and double-click it in Finder or Windows Explorer and … what? Why is it saying the disk contains no data? It could be a couple of things. Maybe the disk indeed contains no data, or we’ve fallen prey to a second threat: media degradation — that is, the “decay” of the media and its contained information over time. Most types of digital media have a limited shelf life and, once they’re gone, it can be difficult or impossible to recover the data.
However, maybe the data are still there but we’re not able to read them. They were probably written on an older computer, and it’s possible that the originating system wrote the data to the disk in a way that is different from what our modern computers expect. Without software to help our modern computer read the disk, we may not be able to determine what files exist, what they are named, or where one file ends and another begins. These are all functions of a data structure called the file system.
But let’s assume that we’re able to browse the file system of the disk, either because it was written in a way that our computer understands or because we installed something that helped us do that. We could run into another problem: the files themselves may not be intelligible to the applications we use in our day-to-day computing environment. Perhaps the files were created using an old database program or were encoded in some format that was intended to be accessed only with a proprietary viewer program — one that is no longer available. This and the preceding file system problem are examples of format obsolescence.
Finally, if we are able to access the disk and the files it contains, read files off the disk, and understand how those files are decoded, we may still be missing crucial information about the data. If they are observational data, we may be missing information about when and where and how they were gathered. If they’re image data, we may be missing information about what the images depict. For any data, we may be missing information about who created them and whether there are outstanding intellectual property restrictions on the data. Depending on our use case, we may not care about these questions, but if we’re interested in rigorous academic work, we probably do care, and this loss of provenance is the final problem we can identify in this scenario.
Worried yet? The good news is that we are not the first people to encounter these problems. In fact, there’s an entire field of digital preservation dedicated to identifying, avoiding, and rectifying many of these problems. Before we talk about how to address these problems, let’s look at some of the basics.
The Goals of Digital Preservation
According to the Digital Preservation Coalition (DPC) (2015), digital preservation is defined as “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” Let’s walk through the components of this definition to explain the broad goals of digital preservation.
We’ll begin with “digital materials” since these are the subject of digital preservation activities. What are digital materials? The word materials suggests a physical form, and digital materials always have a physical instantiation somewhere, whether they are stored on a 5.25-inch floppy disk, a server, an external hard drive, a USB flash drive, or a CD. Each of these storage methods encodes information in some manner, whether through magnetic fluctuations (servers, floppy disks, and many external hard drives), charged cells (flash drives), or pits (CDs). This first layer of mediation is followed by more — considerably more than one usually finds with analogue records. For example, take a textual document like a memorandum. In paper format, there are two immediate levels of mediation: the physical sheet of paper (Is it intact and complete? Or is it damaged?) and the text written on it (Is it visible or faded? What language is it written in?). An equivalent digital memo in Microsoft Word’s DOCX format must be first retrieved from a storage medium as a series of bytes, which, when grouped together, make a bitstream with a discrete beginning and end. Usually, more than one bitstream is required to compose an individual file. This is the case with the DOCX format, which is made up of a number of XML text files and folders grouped together into a ZIP package. It’s easy to forget that we call a digital file a file because it is composed of a series of smaller pieces of information, like a paper file would contain individual documents. In other cases, multiple individual files may be accessed independently but need to be run together for the intended output, such as scripts used to process input data; or a collection of text files in HTML, CSS, and JavaScript format plus images and PDFs that together make a website. At the simpler end of the spectrum, a single bitstream makes up the entirety of one plain text file.
In either case, the bitstreams must then be interpreted according to a particular structure: the file format. A file format is “a convention that establishes the rules for how information is structured and stored in a file” (Owens, 2018, p. 47). File formats link bitstreams and file systems with software. Given a particular file format, operating systems then enable the installation of particular pieces of software to read, interact with, and save files in that format. They also have the advantage of supporting exchange — since each file in a particular format is structured in the same way, it’s understandable to different applications or systems that wish to open a file in that format. But a file format is a human construction: “all conversations about formats need to start from the understanding that they are conventions for how files are supposed to be structured, not essential truths” (Owens, 2018, p. 120). Some file formats, especially those tied to one piece of software, are not accessible without that software in place and “lock in” users to a particular commercial product. File formats also change over time in step with software and user requirements: software in one version may be incompatible with a file format in an older version. Specialized software (used in research fields like health sciences, social science, or biology), even if not commercially sold, may nevertheless use unique file formats or run on different versions of software that are not well documented or supported.
Software requires a physical computer to run on, composed of hardware pieces such as memory, processors, and storage space. An operating system (OS), such as Windows, Mac OSX, or Linux, is a piece of software that controls all of those components, plus additional ones like input devices (keyboard, mouse), output devices (display, printer), storage, and networking. Operating systems also control access to the computer’s file system, which determines the rules for how and where data are stored and retrieved from a storage medium. Due to the specific implementations of each OS, certain software may run only on specific OSs or be limited to specific versions of one.
Next, let’s look at the idea of “continued access,” which is affected by the level of openness, such as whether materials are available for free use online, by request, or restricted to particular individuals or community members based on cost, privacy, copyright, or other restrictions. Continued access can be threatened by issues such as loss due to a subscription cancellation or a service provider who has gone out of business. As such, digital preservationists need to maintain information about provenance and rights to access digital objects over time. The de facto standard for this information is the PREMIS metadata standard maintained by the Library of Congress, which provides a framework for recording detailed information about the actions conducted to maintain digital materials over time.
Finally, the DPC definition acknowledges that not all digital materials will be maintained forever: “for as long as necessary” is more realistic. Some materials have immediate value, but that value may fade over time; other materials must be deleted as governed by privacy legislation or rules for conducting ethical research. In an ideal world, the digital preservationist hands off their maintenance work to others to continue it. This is the second meaning of managed as described above: the work of digital preservation must take place within a structure, institutional or otherwise, that will outlast reliance on particular individuals.
Digital Preservation Versus Curation
If digital preservation is a set of maintenance processes with the goal of maintaining access over time, then a subsequent question arises: given all the human and technical resources required, what should be preserved? The subject of determining preservation priorities — which identifies the materials an organization chooses to put resources into preserving and which it does not — falls into the broader area of digital curation and, specifically, appraisal as part of the curation process. Appraisal, as outlined in Jonathan Dorey, Grant Hurley, and Beth Knazook’s Appraisal Guidance for the Preservation of Research Data, involves the determination of value. In the case of research data, which are typically deposited by a creator with an organization, the question becomes, does this set of files possess adequate future value to merit acquisition and preservation? If your organization has a mission to preserve materials for the long term, then you will need access to the right subject or domain knowledge to make these value judgements. You may also call upon collections development strategies or policies to determine if a candidate dataset is within the scope of your organization’s priorities. In addition, specific digital preservation expertise may be needed to identify whether the materials can be preserved, the types of preservation interventions required, and the resources needed to do the work. This process is a technical appraisal. Once the value of a dataset is established, subsequent curation activities may focus on improving the materials through quality checking, running code, and improving documentation and metadata. You may also need to identify individual files in a dataset that should not be retained or, conversely, missing files that need to be collected. A thorough list of these types of activities are offered by the Data Curation Network’s CURATE(D) workflow and the Dataverse Curation Guide, prepared by the Digital Research Alliance of Canada.
In line with the DCP definition of digital preservation being “for as long as necessary,” the choice to retain a dataset is not permanent: datasets may be revisited through a reappraisal process to ensure they continue to hold value to the organization and its community.
Designated Communities
Given the many possible choices when identifying preservation interventions for a specific set of materials an organization has decided to keep, preservationists may ask how to decide what steps to take. The Open Archival Information System (OAIS) standard contains a useful concept that aids in this work: the idea of a “Designated Community.” In OAIS, this is defined as follows:
An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities. A Designated Community is defined by the Archive and this definition may change over time (CCSDS, 2012, p. 1–11).
Many librarians and archivists have struggled with this concept, since narrowing their activities to a specific group can be seen as conflicting with their professional duty toward broad and accountable public access (Bettivia, 2016, p. 5). Defining a Designated Community does not preclude preserving materials for everyone, but it does force the preserver to consider needs when making preservation decisions, including the outcomes of preservation interventions, the metadata available to users, and the common set of services enabling access (Marks, 2015, p. 16). This means doing “preservation for someone rather than preservation of something” (Bettivia, 2016, p. 3). Many institutions have implicit Designated Communities, such as faculty members, students, and staff at an academic institution, citizens of a town or territory, or employees of a private organization, even if they possess a broad public mandate. Defining a Designated Community forces these assumptions to be made explicit. Primary, secondary, and tertiary Designated Communities may also be assigned, with decreasing levels of specificity, to capture the widest possible set of members without making impossible promises to preserve all materials on behalf of “the world.”
When doing preservation for identified communities, the information being preserved must remain independently understandable to members of that Designated Community. OAIS defines “independent understandability” as “a characteristic of information that is sufficiently complete to allow it to be interpreted, understood and used by the Designated Community without having to resort to special resources not widely available, including named individuals” (CCSDS, 2012, p. 1–12). This means that materials should be usable by community members without outside help. As the curator, you need to understand what knowledge the members of the Designated Community will have and provide materials that will be accessible to them. In a Research Data Management (RDM) context, it’s common to assume a level of expertise related to the domain or discipline in which the data are produced. For example, a social science data repository would assume that members of its primary user community (social science researchers) are able to use statistical analysis software, so preserving and providing tabular data in raw format for use in R or other software would be sufficient. If the repository desires to be usable by non-experts, it may be necessary to provide other options for access, such as an interactive visual interface for querying tabular data. In this way, at some layers of the preservation and access infrastructure, “there is a commonality of services, and at some point subject-specificity may dictate a need for different approaches to serve different Designated Communities” (Bettivia, 2016, p. 6). At the end of the day, as Nancy McGovern’s (2016) Digital Preservation Management Model Document observes, “A digital archive may be dark, dim, or lit, but the absolute proof of preservation is in the capability to provide meaningful long-term access.” Or, if the digital materials can’t be used, then they haven’t been usefully preserved.
Significant Properties
Having established the concept of Designated Community, we can now turn to another important concept, one that flows directly out of the Designated Community and their needs: significant properties. The Digital Preservation Coalition (2015) Glossary defines significant properties as “characteristics of digital and intellectual objects that must be preserved over time in order to ensure the continued accessibility, usability and meaning of the objects and their capacity to be accepted as (evidence of) what they purport to be.”
Significant properties are important because they are derived from the specific perspectives and needs of the DC. In particular, they are the properties of a given data object that meet the DC’s needs. These significant properties will vary depending on the data object and, even within the scope of a single object, can be as diverse as the Designated Communities that may access it. That being said, in almost all cases, there are a number of significant properties that are identified as important.
One of these key significant properties is format. As mentioned above, digital objects often need specific pieces of software in order to be accessed, and the software’s ability to perform relies on its ability to interpret how the meaning of the data are encoded in the file — the file format. Different types of research data, such as tabular data, text documents, images, and audio or video recordings, may utilize different file formats to store information accurately and efficiently.
Another significant property of research data is their metadata, which can include information about the data’s creator, methodology, coverage, and other relevant details. Accurate and comprehensive metadata are essential for understanding the context and meaning of the data as well as for enabling proper citation and attribution. Within the research data realm, these metadata can be quite specialized, as the data are as well. For example, historical survey data used to support social science research may be described in the DDI metadata standard, which allows for the robust description of potentially relevant details, such as survey population, sampling methodology, and so on. A dataset gathered as part of an astronomy project will likely have little use for these same fields but will require a host of other ones — perhaps relating to telescope orientation, weather conditions, and others. For more about metadata and for a discussion about the important considerations when selecting long-lived file formats, please see chapter 9, “Insights Into the Fascinating World of File Formats and Metadata.”
In addition to these technical properties, research data may have other significant properties related to their content or context. For example, data may be part of a larger research project or study or may be linked to other datasets or materials. It’s important to consider these relationships and connections when preserving research data to ensure that the data can be understood and used in the context in which they were created. It’s harder to generalize about how these significant properties are stored because it can depend on the context of the researcher or group that gathered the data or the repository in which the data are found. Some of the questions you may want to ask in looking at these properties include but are not limited to the following:
- Is this dataset part of a series?
- Does this dataset have other versions?
- Are these data in support of a specific publication?
- Are these data a subset of a larger dataset?
Although significant properties can be tricky to identify at first, the most important thing to remember is that they are an expression of the needs of the Designated Community. So, when in doubt, consult with a member of the DC or at least think about what aspects of the data are necessary to ensure the data are usable by that community.
Digital Preservation in a Research Data Context
Preservation Actions
This section now turns from conceptual frameworks to the daily practice of digital preservation through the identification, performance, and evaluation of preservation actions.
Four broad categories of commonly performed preservation actions are discussed below:
- Checksums and bit-level preservation establish integrity and a baseline of assurance that materials remain intact and complete over time. Bit-level preservation requires organizations to identify robust strategies for preservation storage and is associated with preventing problems around media obsolescence and media degradation.
- Technical metadata are commonly extracted from individual files or bitstreams, which can help inform the management of the files and bitstreams over time. File format identifications are the most common value extracted for this purpose. These actions help ameliorate risks associated with format obsolescence and loss of provenance.
- File format validation takes inputs from the process of identification and, for certain formats, evaluates whether the file in question meets the basic standards for structure and quality as defined for that format. This process relates to format obsolescence but can also help identify potential media degradation.
- Finally, normalization and migration actions can be taken in order to ensure data are not locked into a forgotten or proprietary format. Again, this speaks to the problem of format obsolescence.
While this list does not include all possible digital preservation activities, these functions are among the actions the most commonly run on a day-to-day basis using particular tools and processes. They consist of the hands-on work of digital preservation, whether enacted manually or, more commonly, through scripted tools or preservation processing software. When evaluating a repository’s functional ability to preserve digital materials, identifying the presence (or absence) of these functions is paramount.
Checksums, Bit-Level Preservation, and Preservation Storage
Bit-level preservation is often considered the most basic set of actions an organization can do to support long-term preservation. This approach is focused on ensuring that files retain fixity (that is, they remain intact and unaltered in terms of the ordering of bits in the file) and that files are stored in multiple locations to protect against accidental loss, modification, or corruption. Bit-level preservation does not guarantee any form of future usability/accessibility based on the contents or format of the files in question. It simply provides the assurance that files are intact. The basic set of actions is as follows: When processing and storing data for preservation, the preserver runs a checksum algorithm against files being uploaded and records the results. On a varying schedule, the preserver runs a checksum check against the same times at a later date. This second check (and all subsequent ones) are called fixity checks. If the output of the second check matches the first, then the materials still have fixity. Ideally, the results of each check along with the date and time are stored in a database or other location each time a fixity check is run.
Checksums are unique numeric or alphanumeric strings of varying potential lengths produced by checksum-generating algorithms, like CRC, MD5, SHA1, and SHA256, based on the contents of a file. When the contents of the file are altered in any way, the checksum value will change, indicating that the file no longer has integrity and should be replaced with another copy. While CRC, MD5, and SHA1 are not considered secure for cryptographic purposes, they are still commonly used for detecting integrity issues. See Matthew Addis’s guide Which Checksum Algorithm Should I Use? for a good discussion of this topic. Indeed, checksums are a core component of many computing infrastructures. The key is to identify when and how they are run. Files are most likely to lose integrity during transport from one system to another, such as uploading files to remote preservation storage over the web. Ideally, a local checksum on your computer is run first and compared with the results of a fixity check on arrival at its destination. Keep in mind that various automated tools will do most of this work for you; files stored in the BagIt format using tools implementing the Python BagIt Library are a commonly used example.
The second important component of a bit-level preservation strategy is having multiple copies. If you identify an integrity issue, the ideal solution is to replace the “bad” copy with an intact version. Hence, having more than one copy, ideally in more than one location, enables you to quickly mitigate integrity issues that might arise. The types of preservation storage methods can vary widely based on the resources available to a preserving organization. For some, separate copies on external hard drives, or the use of RAID drives, or local network storage (ideally backed up) may be all that is possible. Organizations doing preservation work on a larger scale may use tape storage systems. And third party services, such as cloud storage or other replicated storage networks, are available for the needs of memory institutions. The storage section of the National Digital Stewardship Alliance (NDSA) Levels of Digital Preservation Matrix is most useful for making decisions about how many copies to make and where to keep them, which ranges from keeping two copies in separate locations (but in the same geographic area) to “at least three copies in geographic locations, each with a different disaster threat” (National Digital Stewardship Alliance, 2019). Note that the NDSA Levels do not need to apply to all materials equally; many organizations apply different preservation storage strategies to different classes or genres of digital materials. See also Schaefer et al.’s (2018) “Digital Preservation Storage Criteria” framework for evaluating different preservation storage options.
File Format Identification
Identifying file formats is usually the first step a digital preservationist takes after ensuring the integrity and safe storage of the materials to be preserved. Knowing the format (and sometimes the specific version of that format) will help you decide how that file should be accessed and maintained over time. As a result, understanding file formats is a special source of concern within the RDM community. Researchers are encouraged to export their final data files in nonproprietary formats, and institutions like the Data Archiving and Network Services (DANS) from the Netherlands have designed file format preferences for inclusion in their repository.
Since it is necessary to reliably identify file formats, there are processes to help with that. You can usually figure out the file format from its extension; however, proprietary, obsolete, or specialized file formats may not be as identifiable, and systems often enable users to change extensions without changing the contents of the file. The key is to find a tool that identifies a file format by its signature. The signature is a series of bytes that occur in a predictable manner at the beginning and often the end of a file. For it to be a reliable marker, every instance of that file format should include this signature. Some file formats, such as plain text files, lack signatures, so inferences about the formatting of that text need to be made from the file’s content and structure. Tools that identify file format signatures commonly query the PRONOM database maintained by the UK National Archives, which includes an extensive listing of signatures associated with different file formats and versions. New formats are frequently added to PRONOM. MIME type identifications, which are commonly used by Internet browsers, email clients, and other software to identify file types, can use signatures but may also fall back on extensions. MIME types do not identify specific file format versions but can be useful when the more demanding threshold of signature-based identification fails. Signature-based file format identification tools include Siegfried (maintained by Richard Lehane) and FIDO (maintained by the Open Preservation Foundation).
File Format Validation
Once a file format has been identified, additional actions can flow from this information. File format validation is the process of checking if a file format meets the specifications that have been designed for that format. Not all file formats have published rules, but when they do, you can check whether any instance of a file is a “good” representation of that format. In the parlance of preservation, two questions are asked of a file: “Is it well formed?” and “Is it valid?” A well-formed file obeys the syntactic rules of its file format: it follows the basic structural rules as set out by its file format standard. Second, for a file to be valid, it must first be well formed. This means it meets higher-level semantically defined rules for the minimum quality of that file format, such as a minimum amount of image data present in a TIFF file. As Trevor Owens (2018) notes, “many everyday software applications create files … that are to varying degrees invalid according to the specifications” (p. 120). In the context of research data, the applicability of validation will depend on the format at hand and the issues identified: does it have a specification published, and is there a tool available to check the file against that specification? Perhaps more importantly, if a file is found to be invalid, or valid but not well formed, what is the subsequent action? If a file is found to be fully corrupted, or the issues identified have a significant impact on the usability of the file, then it may be desirable to return to the creator and ask them to remediate the issue. In other cases, preservationists record validation information in metadata but do not act upon it. Paul Wheatley (2018) documents a useful set of questions to evaluate validation errors: Is the file encrypted? Is it dependent on external components you don’t have? Is it significantly damaged? Is the file in the format you think it is? Validation can help identify these issues at many stages. Some of these questions may be answered during the curation phase where a data curator is actively checking files for their quality, completeness, and usability. A subsequent preservation workflow may then simply record validity in the metadata output in service of a validation check again in the future. Tools for file format validation include JHOVE (maintained by the Open Preservation Foundation for a range of formats) and veraPDF for PDF/A files.
File Format Conversion: Normalization and Migration
File format conversion is perhaps the most active process we’ll discuss in this section. Rather than gathering information about the files, conversion of files into alternative formats actively affects the contents of the files themselves. As noted above, this action can take place before files reach a repository, such as when researchers or other creators are encouraged to export their files in specific nonproprietary or otherwise preservation-friendly formats. Based on the results of file format identification, it may also take place while processing files to be placed in preservation storage. File format conversion also has the potential to impact significant properties or even the informational contents of a file and should be undertaken with an assurance that the resulting file in a new format still meets the needs of the Designated Community. Repeated testing and validation of conversion outputs with a variety of sample files is key.
Normalization and migration are different processes but end with the same result. Normalization is the process of converting files to a standard set of formats, as defined by the archive or repository, upon receipt or ingest. The idea is that the repository then has to manage only a subset of file formats into the future. Migration is when a repository converts files to a secondary format at some later date, usually at scale, in response to an identified risk, such as a format that is no longer supported. During both processes, a new copy of the file is created in a different format, which also must be managed by the repository. The original copy is usually retained to prevent accidental information loss as a result of the conversion. While preservation normalization was a default for many repositories in the past, more are carefully evaluating when normalization should occur to ensure that they are minimizing the environmental and financial impact of creating more copies than required.
Normalization and migration for preservation must be distinguished from these same actions for the purposes of access. Access normalization or migration is used to provide access copies to the Designated Community based on their needs. For example, a large TIFF file containing a map might be normalized into a JPEG for easier access online.
Tools for file format conversion are many and varied based on the specific format at hand. For example, common tools used in automated workflows include ImageMagick for images and FFmpeg for audio and video.
Evaluating Preservation Actions
At the File and Collection Level
Evaluating the results of preservation actions for individual files or collections at different levels of aggregation means running an action, such as file format identification or normalization, and inspecting the output. Typically, this is conducted on a test basis until the outputs are identified as acceptable, at which point more automated and scalable approaches take over for the final version. For file format identification and validation, the question is whether the result is as expected. For example, NVP files, which are produced by the NVIVO software for qualitative data analysis, are not yet identifiable using a tool like Siegfried because there is no description of this format in PRONOM. The preservationist must decide if additional tools should be implemented to identify these files or if they are comfortable waiting for a future update to PRONOM, at which time they would rerun the identification process. If a file is not well formed but can be opened and viewed as expected, then the error flagged by the tool may not require deeper triage. It’s also important to evaluate the results of normalization and migration actions. Does a particular conversion tool produce a result that meets the needs of the Designated Community based on informational content as well as presentation? If not, additional tools and strategies, such as emulation, may be required. For example, converting MS Office documents, such as PowerPoint presentations, to PDFs requires access to the original fonts used unless they were embedded in the original file. Lacking access to these fonts, the layout and appearance of the PDF version may be different than the original. Is this important to the member of the Designated Community who is accessing the file, or is the informational content sufficient? Having access to members of the Designated Community via advisory groups, or querying members of the user community can help make these evaluations.
At the Software and System Level
Based on the above examples, you can see how thinking about outputs at a granular level impacts decisions made system-wide. Implementing one tool to solve one set of problems then affects other relevant files in the repository. While preservation actions may be run individually, on a file-by-file basis, it’s more common for preservationists to rely on workflow tools designed to automatically run a series of linked actions at scale. A second job of the preservationist is to assess the functionality and impact of workflow software, including whether it can perform the required preservation actions in addition to validating the results. Some organizations may create custom, in-house scripts or tools for performing preservation actions, others may rely on open source or commercial software developed by third parties. However, for individual preservation actions, most preservation workflow tools (including commercial software) will use many of the open source tools mentioned above, such as Siegfried and JHOVE. One example of such software is Archivematica, an open source workflow application designed to produce preservation-worthy packages of data for long-term storage. Archivematica includes processes to create and validate checksums; perform file format identification, validation, and normalization for preservation and access; and connect with storage systems to deposit files for long-term storage. It packages preservation metadata using the METS and PREMIS XML standards. Defining the preservation priorities of the institution and understanding the collections it wishes to preserve can inform decisions about which preservation-supporting tools to implement and how to configure those tools. Making these determinations leads to defining preservation strategy and planning.
At the Strategy Level
Methods to link tools like Archivematica with systems and software for uploading research data have also been created. For example, an integration between the Dataverse software platform (research data repository software) and Archivematica enables preservationists to select and process research datasets independently of the repository software, meaning that they can store and manage research data deposited to a Dataverse collection as part of a larger preservation strategy at their institutions. For more information on the Dataverse software platform and Archivematica, see Meghan Goodchild and Grant Hurley’s paper, “Integrating Dataverse and Archivematica for Research Data Preservation.” In contrast, hosts of Dataverse installations may also offer preservation functionality. For example, the Borealis application (which is an instance of a Dataverse installation hosted in Canada) includes a bit-level preservation strategy that involves regular integrity checking and replicated storage. Another job of the preservationist is to evaluate what kinds of actions are required across the collections stewarded by the institution. For example, an institution may be comfortable relying on a basic, bit-level preservation strategy for data that it is stewarding for a short period of time or for which it does not consider as core to its institutional collections. Others might define an appraisal or accessioning policy that identifies the requirements for datasets to be processed into preservation storage. Both approaches might be used in combination for different collections: lower-risk, lower-value materials might require only a bit-level strategy, whereas materials with higher value to the institution might require a more advanced approach using Archivematica. The same questions also apply to types of preservation storage selected as discussed in this chapter’s section, Checksums, Bit-Level Preservation, and Preservation Storage. Preservation planning at this level requires the definition of policies, plans, and other documentation. See Christine Madsen and Megan Hurst’s “Digital Preservation Policy and Strategy: Where Do I Start?” for a useful introduction to this topic.
Conclusion
Research data that are stored digitally are subject to a number of threats to their ability to be accessed in the long term. These threats can include degradation of the files themselves or the loss of knowledge necessary to access the digital objects or to understand them once accessed. Happily, there are a number of standards and practices that have been developed to mitigate these risks. Such interventions can be both technical and policy-based, but all require two things. First is some degree of thoughtful planning, as it can be difficult or impossible to reverse engineer the knowledge necessary to understand a digital object should such be forgotten. Second is an understanding of the Designated Community — the group for whom the data is being preserved. This knowledge allows preservationists to choose appropriate actions to ensure data remain understandable, meaningful, and authentic for its intended users.
Reflective Questions
- What are some threats to the longevity of research data over time? Do these threats differ depending on the type of data being considered?
- Can you envision a scenario where an institution might choose to take some preservation actions but not others? For example, why might an institution engage in the generation and verification of checksums but not do any file format normalization?
- Think of an example dataset with which you are familiar. Then think of the users who might want to access this data. What questions are users likely to ask about the data, and why? Is it to help them know what piece of software they would need to open the files in the dataset, or is it about understanding where the data came from and how they were gathered?
Now think about the same users ten years in the future. Do you think a member of this future group would be asking the same questions, or might their concerns be different? If so, how?
Key Takeaways
- Common threats to data include the following: media obsolescence, media degradation, format obsolescence, and loss of provenance.
- Possible preservation actions include the following: checksums and bit-level preservation, technical metadata extraction, file format validation, and normalization and migration.
- When evaluating preservation actions, consider (1) what risks you are addressing and (2) the cost-effectiveness of the action.
- The effectiveness of preservation actions may vary depending on whether you are looking at files or collections, a system or repository, or an organizational-level scale
Additional Readings and Resources
Addis, M. (2020). Which checksum algorithm should I use? Digital Preservation Coalition. http://doi.org/10.7207/twgn20-12
Borealis. (2022). Borealis preservation plan. https://borealisdata.ca/preservationplan/
Dorey, J., Hurley, G., & Knazook, B. (2022). Appraisal guidance for the preservation of research data. Appraisal for Preservation Working Group for the Digital Research Alliance of Canada. https://zenodo.org/record/5942236
Goodchild, M., & Hurley, G. (2019). Integrating Dataverse and Archivematica for research data preservation. In M. Ras, B. Sierman & A. Puggioni (Eds.), iPRES 2019: 16th international conference on digital preservation (pp. 234-244). https://osf.io/wqbvy
Lavoie, B. (2014). The Open Archival Information System (OAIS) reference model: Introductory guide (2nd Edition). Digital Preservation Coalition Technology Watch Report.
Madsen, C., & Hurst, M. (2019). Digital preservation policy and strategy: Where do I start? In J. Myntti & J. Zoom (Eds.) Digital preservation in libraries: Preparing for a sustainable future (pp. 37-47). ALA Editions Core, American Library Association.
Reference List
Bettivia, R. S. (2016). The power of imaginary users: Designated communities in the OAIS reference model. Proceedings of the Association for Information Science and Technology, 53(1), 1-9.
CCSDS. (2012). Reference model for an open archival information system (OAIS). (Recommended practice CCSDS 650.0-M-2). https://public.ccsds.org/pubs/650x0m2.pdf
DANS. (2022, June 20). File formats. https://dans.knaw.nl/en/file-formats/
Digital Preservation Coalition (2015). Glossary. In Digital preservation handbook (2nd ed.). https://www.dpconline.org/handbook/glossary
McGovern, N. (2016). Digital preservation management model document. https://dpworkshop.org/workshops/management-tools/policy-framework/model-document
Marks, S. (2015). Becoming a trusted digital repository. Trends in Archives Practice Module 8. Society of American Archivists.
National Digital Stewardship Alliance (NDSA). (2019). 2019 LOP matrix. https://osf.io/36xfy
Owens, T. (2018). The theory and craft of digital preservation. Johns Hopkins.
Schaefer, S., McGovern, N., Goethals, A., Zierau, E., & Truman, G. (2018). Digital preservation storage criteria, version 3. http://osf.io/sjc6u/
Wheatley, P. (2018, October 11). A valediction for validation? Digital Preservation Coalition Blog. https://www.dpconline.org/blog/a-valediction-for-validation
the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.
any piece of information, either singular or in assemblage, that is stored by computers. They are called digital because all computer-readable versions of data are ultimately encoded as a series of ones and zeroes, which are the only inputs computing systems can understand.
sources of information or evidence that have been compiled to serve as input to research.
(ISO 14721) the Open Archival Information System. Published in 2005 and revised in 2012, OAIS defines a set of requirements for an information system meant to maintain the usability of digital objects over time.
a threat to the longevity of digital objects based on the notion that the media upon which they are stored may no longer be usable because a user would not have the correct hardware (or software like drivers) to access the data on the media. At the time of this writing, media obsolescence is commonly associated with floppy disks or various data cartridge formats that have fallen out of common use over time. Media obsolescence threats are often addressed by bit-level integrity methods, including the migration of digital objects to newer, more modern carriers on a regular basis.
a threat to the longevity of digital objects based on the decay of the carrier medium upon which they are stored. Sometimes called “bit rot.” Media degradation threats are often addressed by preservation actions that ensure bit-level integrity, including the active monitoring of digital objects to detect corruption/loss, and are often protected by maintaining multiple copies of an object on different pieces/types of media.
a threat to the longevity of digital objects based on an inability to decode the bitstream comprising the digital object. Format obsolescence threats are often addressed through a program of file format identification, validation, and – if necessary – normalization/migration.
a threat to the longevity of digital objects based on members of the user community being unable to discern important information about the digital object, such as its source, its history of changes, and ultimately its authenticity. Threats to the provenance of a digital object are often addressed through the careful creation and maintenance of preservation metadata.
a standardized method of arranging ones and zeroes that can be used to encode specific types of information.
a record of the source, history, and ownership of an artifact, though in this case the artifact is computational.
a metadata standard and data dictionary developed to standardize the way that preservation systems record and understand important concepts in the long-term preservation of a digital object. PREMIS flies can include technical information (e.g., file format information, checksums) as well as provenance information (e.g. changelogs, acquisitions information).
data about data; data that define and describe the characteristics of other data.
a conceptual entity introduced by OAIS, representing potential users of a digital object being preserved by an archive. Designated Community is a crucial concept in long-term preservation planning because understanding the needs and capabilities of the Designated Community allows for informed decision-making regarding things like choices of file formats and retention of data.
a term that describes all the activities that researchers perform to structure, organize, and maintain research data before, during, and after the research process.
for the purpose of the FAIR guiding principles, data object is defined as an Identifiable Data Item with Data elements + Metadata + an Identifier.
unique numeric or alphanumeric strings of varying potential lengths produced by checksum-generating algorithms, like CRC, MD5, SHA1, and SHA256, based on the contents of a file.
a level of preservation that commits to the preservation of the ordered ones and zeroes that comprise a digital object, but which does not necessarily address the understandability of the encoded data.
process of converting copies of original files to one of a small number of non-proprietary, widely-used, and preservation-friendly formats during ingest. Normalization standardizes ingested material into a subset of formats stored by an archives, and allows the archives to avoid managing a large number of formats into the future. However, normalization can also alter file sizes and properties. Archives should assess normalization priorities and approaches through researching and defining file format policies (Scholars Portal, n.d.).
a concept relating to the permanence of digital objects. Establishing consistency in digital objects can be tricky, as the way they are stored means that objects are often copied or transmitted frequently, raising questions as to whether the resulting object is the “same” as the object before copying/transfer. In common practice, fixity is closely tied to the generation and verification of checksums, which can help ensure that an ordered series of bits have remained unchanged.
a series of bytes that occur in a predictable manner at the beginning and often the end of a file.
data generated by research examining social aspects of the human condition using descriptive methods rather than measurement.
when software is open source, users are permitted to inspect, use, modify, improve, and redistribute the underlying code. Many programmers use the MIT License when publishing their code, which includes the requirement that all subsequent iterations of the software include the MIT license as well.