A Glimpse Into the Fascinating World of File Formats and Metadata

Émilie Fortin

doi:10.5206/QSVH4764

Working with Data

9 A Glimpse Into the Fascinating World of File Formats and Metadata

Émilie Fortin

Learning Outcomes

By the end of this chapter you should be able to:

Understand what a sustainable file format is.
Properly choose a file format that meets your needs.
Understand the usefulness of metadata.
Identify the different types of metadata.

Introduction

The research data lifecycle always includes a preservation stage, sometimes referred to as archiving or retention. This stage is linked to data reuse because no one can reuse damaged or inaccessible data. The chapter, “Digital Preservation of Research Data,” addresses the issue of digital preservation; this chapter focuses on two elements that enable data retrieval and reuse: file formats and metadata.

File Formats

Pre-assessment

Answer the following questions as honestly as possible (Yes, No):

Are you having trouble opening files that you created more than ten years ago?
Do you think that ten years from now you will have difficulty opening files you created this year?
Do you think a PDF file is a perfect preservation format?
Do you wake up at night wondering if your great-grandchildren will still have digital photos of you?
Do you love interactive apps and want all your projects to be as connected as possible?

If you answered yes to more than two questions, this section should help you.

What is a File Format?

Digital file formats are designed according to predefined rules that outline their structure and organization. These principles are usually listed in a specification document that provides details on the subdivisions, encoding, and internal relationships that allow a format to be constructed and validated. A format specification indicates the boundaries between bit sequences. These bit sequences can represent, for example, a character, an operation to be performed (machine instruction), or a colour selection.

In summary, a file format is a specific and conventional series of 1s and 0s used to recognize a format.

From the moment you use a computer media, no matter what you use it for, keep in mind that you are using, creating, or modifying formats.

What is a Sustainable Format?

No format is truly sustainable. Those that are deemed acceptable for long-term preservation are formats that remain accessible over time despite technological developments. A good format today can become obsolete in two, five, or ten years.

Here are some criteria for judging the sustainability of a format:

complexity
backwards compatibility
encoding
dependency
openness
metadata
property
usage
evolution
protections

Complexity. A format must provide good capabilities, without being too complex, or it will be difficult to maintain over time due to its many features. The complexity of a format can be defined by its readability by humans, its level of compression, and the variety of its functionalities. The more effort needed to decipher a format, the more likely it will not be fully understood.

Backwards compatibility. Is a format known for its backwards compatibility? When a new software version is released, how feasible is it to open formats created with older versions of the software? Are the generations of the same format very different from each other?

Interesting fact: Did you know that Adobe provides backwards compatibility of PDF formats up to version 1.3 (released in 1999) only?

Encoding. In the Western world, formats will likely rely on ASCII or Unicode encoding. If you use other symbols or non-Latin characters, encoding is important because you want the letters and symbols to display properly no matter who opens your files.

Dependency. This is a question of the format’s dependence on its software, but also on a specific technology or hardware, on other files, or on its computer environment. Can the format be opened only by specific software? Is the format a container in which we find other formats (ZIP type compression format, video embedded in a text file, video file with a soundtrack, etc.)? Does the format need to connect to your environment to work (for example, an interactive book that is connected to your phone’s camera)?

Resources external to your file can be lost over time, so the more dependencies a format has, the harder it will be to preserve in its current form.

Openness. An open format is preferable.

Examples of open formats: Office files with an X (e.g., XLSX, DOCX), PDF, TXT, JPG, PNG, CSV.

Interesting fact: Some extensions sometimes hide files in open formats. For example, a script file may have extensions like HTML, XML, SC, but they are actually plain text formats.

Interesting fact: Some open formats have become standards over time. For example, PDF and PDF/a are ISO standards.

Metadata. This refers to the file’s internal metadata. Think about the file properties that you can access in a software application and through your operating system.

Identifying a format is a first step but documenting the content and the container as much as possible within the format is also very useful. The more a digital object is documented, the better it can be understood in the years to come. A file format that can embed metadata is advantageous, because if the file no longer opens, it is sometimes possible to retrieve valuable information thanks to its metadata (e.g., title, creator, software used to save the format). For more details on this, please see the “Metadata” section.

Property. A proprietary format belongs to a legal entity. It may or may not be open. Its evolution is controlled by its owner. These formats are generally attached to specific software. When the formats are non-proprietary, their evolution is controlled by a community of users and they are for the most part open.

Examples of non-proprietary formats: MKV, TXT, XML, CSV, PNG
Examples of proprietary and open formats: Office files with an X (e.g., DOCX, XLSX ), PDF, RAR
Examples of proprietary formats: AutoCAD, PSD, WMA

Usage. If only ten people use a format, even if it is open and non-proprietary, it will disappear. On the other hand, an extremely popular proprietary format is very unlikely to die out in the next few years.

If a closed, proprietary format is adopted as a standard by a library, archive, or research community, there is a good chance that the format will live on thanks to its popularity. However, its development needs to be closely monitored.

Evolution. The format should follow a continuous improvement cycle but avoid excess. Systems change, and software and formats must evolve; a static format is not necessarily better than a format that is in development. However, releasing a series of new versions of a format within a limited time frame can be unwise, as frequent changes threaten long-term accessibility.

Protections. There are several technical file protection measures. For example, encryption and the use of a password are good methods for protecting sensitive data, but they are not compatible with long-term preservation. Just imagine the impact that losing a password can have!

Similarly, certain measures to protect the intellectual property of a file, such as locks on e-books, may compromise access to content.

Interesting fact: Some platforms allow for restricted access to files by applying permissions checks. This method is far preferable to locking the files themselves.

How to Choose a Format for a Research Project?

The criteria that define a sustainable format are important, but it is essential to choose them in ways that meet your project needs. It is not necessary to comply with all the criteria. Also, if your area of research requires you to use a format that does not meet any sustainable format criteria, you don’t need to refrain from using it; just be aware that there will be an impact on data preservation.

Here are some questions you can ask yourself to help you choose the best format:

Do you need to preserve your data long term? If you plan to delete all your data in five years and not share it, think only of your own immediate usage needs.
If you use research instruments/equipment, do you have a choice of format? If so, try to opt for a sustainable format if doing so would have no impact on your research.
Is the data’s appearance or layout important, or just the data itself? If the data layout is not important, you can opt for a simpler format. For example, a textual document stored as a PDF helps preserve the look and feel of a document, but content reuse is complex. However, if the text document is converted to TXT format, the formatting is lost, but the content can easily be reused.
Are the data independent or linked to other data? If your data are linked to equations or other files, you must preserve those links.
Do you need to control for file size? If you are limited on space, you may not have a choice but to opt for compression. Try using lossless compression.
In your discipline, is there a format that is used by most of your colleagues and that is considered essential?

In some cases, it is possible to keep data both in its original format and in a sustainable format, but this duplication must have a purpose. For example, your data may serve two very different communities that do not use the same level of technology. However, you should avoid the confusion that two versions of the same dataset could cause.

Another option could be to keep only the original format and to generate lighter copies of the files when necessary. This option is risky in the sense that it involves a dependency on software to read the original format.

You should also keep in mind that unreadable data in ten years will no longer be useful to anyone, including yourself.

Most national libraries publish a list of recommended formats. I’ve included a number of these lists in the Additional Resources section of this chapter; it may be useful to consult them. The lists include some of the formats that are generally accepted as sustainable in 2023.

Databases

A database involves values, but also a structure and relationships between values. The most commonly used databases at the time of writing are Microsoft Access, Oracle, MySQL, and PostgresSQL. When looking at long-term preservation of databases, one must assess future needs: is the database still in use? Will the preservation of values alone be sufficient? Must the structure of the database and relationships between data also be documented?

Databases are complex to preserve given their structure and the evolution of their content. It is important to define needs before choosing a preservation format.

Some recommended formats include:

Formats with value separators (CSV, TSV, TXT): preserve data, but not relationships or formulas. Especially useful for simple and small databases.
Database Preservation Format (SIARD 1.0 and 2.0): an open format established for preserving databases but only usable for certain types of databases.
Lightweight Relational Database Format (SQLITE): a simple format used for relational databases.

Tabular Data

The main challenge with these formats is dealing with formulas, macros, and embedded content. It should also be remembered that exporting a tabulated file to cloud computing software, or vice versa, can cause losses or errors.

Note that SPSS’s SAV format is sometimes recommended, although its documentation is unofficial and backwards compatibility is not guaranteed.

Some recommended formats include:

Data with Delimiters (CSV, TXT, TSV): simple files, but there is a loss of formulas and cell relationships.
Microsoft Excel (XLSX): documented and open format, but not recommended by some repositories, as it is a complex proprietary format. In some cases it remains unavoidable. If used, be sure to create a file with Office 2013 or later.
OpenDocument (ODS, FODS): usually associated with LibreOffice, a software suite developed as an open equivalent of Microsoft software. Structure based on XML. Version 1.2 is certified as an ISO standard; version 1.3 has achieved standard status.

Text

A text document can be very simple, but it can also bring about some challenges. For example, using cloud-based word processing software makes collaboration much easier, but exporting these documents to save them locally can sometimes affect their formatting and hyperlink functionalities. Also, you should ask yourself which versions to keep; it is irrelevant to preserve all revisions and comments to a text. A solution would be to preserve some intermediate versions along with the final version.

If the text document contains embedded objects, such as an image or a table, the selected format may vary. The choice of fonts can also affect the preservation of a textual document.

The text might also refer to other documents to help contextualize or better explain the content. These relationships are important and must be maintained.

The most appropriate format is the one that will retain the most functionalities from the original document while allowing for long-term access.

Some recommended formats include:

OpenDocument (ODT, OTT): usually associated with LibreOffice, a software suite developed as an open equivalent of Microsoft software. Structure based on XML. Version 1.2 is certified as an ISO standard; version 1.3 has achieved standard status.
Plain text (TXT): no page layout, but easily accessible and does not depend on any program, which is why it is highly recommended for README files.
PDF and PDF/a: common format, often used for long-term preservation. Ideally, make sure to only keep versions 1.3 and later.
Electronic Publication (EPUB): an open format, widely used for digital publishing.

Interesting fact: Commercial EPUB files may contain built-in protections to protect intellectual property by preventing copying and sharing. These digital locks are incompatible with long-term preservation.

Images

Most digital preservation experts agree on the most secure image formats to use. The formats mentioned below are raster files; that is, they consist of a series of dots called pixels.

The quality of a format can vary according to several factors such as resolution (the best known), but also colour space or colour depth. Often, the higher the quality of an image, the larger the file.

The RAW proprietary format is not recommended for long-term preservation. Conversely, an image created with a compressed format (e.g., GIF, JPG, BMP) could be preserved as is. Ultimately, technological, human and financial needs and resources need to be assessed before choosing an image format.

Some recommended formats include:

Tagged Image File Format (TIFF): most used format for preserving images, but heavy.
Joint Photographic Experts Group 2000 (JP2): lighter than TIFF, but less widely used.
Joint Photographic Expert Group (JPG): widely used, but the image is compressed.
Portable Network Graphics (PNG): uses lossless compression. Fairly commonly used, but not always supported by software.

Audio

An audio format is a container with one or more audio data streams.

Several characteristics need to be considered that will influence the rendering and authenticity of the sound: channels, compression, number of bits per sample, number of samples per second, etc. If the original file is already compressed (e.g., MP3, AAC), it may not make sense to migrate it to another format.

Note that MP3 is a compressed format not generally recommended for long-term preservation, but its widespread adoption makes it a fairly reliable format if the original file was created that way.

Some recommended formats include:

Free Lossless Audio Codec (FLAC): file with lossless compression, lighter format than WAVE.
PCM WAVE (WAV): quality format used by several national libraries during digitization.
Broadcast WAVE (BWF): allows the addition of metadata in the files.
Ogg Vorbis (OGG): open format with better compression than MP3, but less popular.

Video

Video formats are complex, ever-changing, and there is no consensus on any one format in the digital preservation community.

Video formats are generally containers with images or streams of video and sound data. Several characteristics (e.g., colour, compression, sound) can influence their long-term preservation. More than one format can be used for a project depending on the different project goals or outputs, which could range from video creation, to editing, to distribution.

The biggest challenge is balancing file weight and file quality.

Some recommended formats include:

MP4 with H.264: compressed format mainly used for broadcasting; very widespread.
QuickTime (MOV) or uncompressed Audio Video Interleaved (AVI) 4:2:2: very heavy formats, but good quality.
Matroska with FFV1 codec (MKV): standardized format not overly compressed.
Material Exchange Format with JPG 2000 (MXF): recommended by some national libraries, well documented, but little used by the public.
Digital Picture Exchange (DPX): very heavy format used when digitizing film stock.

Geospatial Data

Geospatial data are also covered in the chapter, “Geospatial Research Data in Canada: An Overview of Various Projects.” These data usually consist of a series of files that complement each other. They can be intrinsically linked to the geographic information system that uses them. Metadata, coordinate referencing systems, and coordinate precision (i.e., how close an observed and recorded value is to the actual value) must be preserved with the data.

Listing recommended formats for the long-term preservation of geospatial data is almost impossible given their complexity (ie., several types of different structures, many proprietary formats). There is no consensus on this and keeping the original format may be the best solution.

Some recommended formats include:

Geospatial Tagged Image File Format (GEOTIFF): an open format that allows geographic coordinates to be added to an image.
Geographic Markup Language (GML): an open format based on a standard, but it is complex.
Keyhole Markup Language (KML, KMZ): XML language that can be associated with several other files that must also be archived (avoid using hyperlinks). Open and widely used format.
ESRI Shapefile (SHP SHX, DBF, PRJ, SBX, SBN): proprietary, but open and widely used format.

Digging Deeper: How to Identify a Format?

To identify a file format, it is usually sufficient to look at the final section of the file name, which is its extension. For example, the file “my-notes.xlsx” is an Excel file while “my-photo.jpg” is an image. This method has limitations since an extension can be modified, voluntarily or not, or it may be completely unknown. Some operating systems are even configured by default to hide the extension of files, which can complicate the task.

The best way to identify a format is by using its signature. A file signature is a series of bits that are strung together in a predictable fashion at the beginning, end, or at both ends of a file.

A tool like PRONOM, widely used in the digital preservation community, works by saving the start and end signatures of a file (known as Beginning of File (BOF) and End of File (EOF)). This allows a user to retrieve the unique identifier of a format. As an example, the signature x-fmt/398 identifies JPG version 2.0. Knowing a format will be helpful to those who want to view datasets and better understand how to open them.

Some file format identification tools include:

PRONOM: http://www.nationalarchives.gov.uk/pronom/
Siegfried: https://www.itforarchivists.com/siegfried
FIDO: https://github.com/openpreserve/fido or https://fido-js.glitch.me/

Tools that allow viewing files in hexadecimal code:

HexEd.it: https://hexed.it/
Literate-binary: https://github.com/marhop/literate-binary

Metadata

Pre-assessment

Answer the following questions as honestly as possible (Yes, No):

Do you understand what “data about data” means?
Do you know that there is more than one type of metadata?
Do you know that some metadata are automatically written into your files?
Do you know that your brother-in-law could appear as the author of a file you created when using his computer?
Do you realize the power of metadata?

If you answered no to more than two questions, this section should help you.

An Introduction to Metadata

Metadata are pieces of information used to describe the content or container of a resource. They can be structured or not.

To understand what metadata are, let’s start with an example of raw data:

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAACTTGGACAACCTGGAACTCTTCTAGGAGACGACCAAATTTACAATGTAATCGTCACTGCCCACGCCTTCGTAATAATTTTCTTTATAGTAATACCAATCATGATCGGTGGTTTCGGAAACTGACTAGTCCCACTCATAATCGGCGCCCCCGACATAGCATTCCCCCGTATAAACAACATAAGCTTCTGACTACTTCCCCCATCATTTCTTTTACTTCTAGCATCCTCCACAGTAGAAGCTGGAGCAGGAACAGGGTGAACAGTATATCCCCCTCTCGCTGGTAACCTAGCCCATGCCGGTGCTTCAGTAGACCTAGCCATCTTCTCCCTCCACTTAGCAGGTGTTTCCTCTATCCTAGGTGCTATTAACTTTATTACAACCGCCATCAACATAAAACCCCCAACCCTCTCCCAATACCAAACCCCCCTATTCGTATGATCAGTCCTTATTACCGCCGTCCTTCTCCTACTCTCTCTCCCAGTCCTCGCTGCTGGCATTACTATACTACTAACAGACCGAAACCTAAACACTACGTTCTTTGACCCAGCTGGAGGAGGAGACCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

Raw data from research, devoid of metadata, are interesting, but not meaningful to most people. It is easy to see that there is a large gap between the raw data extracted during a research project and their meaning and, thus, usability for humans.

If a geneticist wants to describe the raw data above, she could add the following description, which would be the first level of metadata:

>Seq1 [organism=Carpodacus mexicanus] C. mexicanus clone 6b actin (act) mRNA, partial cds

A second level of metadata would be the description of the dataset that this sequence is a part of: genetic sequencing, in this case, of Carpodacus mexicanus, a species of bird.

This is a nucleotide sequence of Carpodacus mexicanus (clone 6b). (A = Adenine, G = Guanine, C = Cytosine, T = Thymine: nucleic acid bases).

A third level of metadata would make it possible to better characterize the previous metadata by standardizing the nomenclature used, which will facilitate search and retrieval in other resources, such as article databases or institutional repositories:

House finch – Genetics
Nucleotide sequence

A fourth level would link this metadata to other relevant information, such as an image.

ⓘ Simon Pierre Barrette - https://commons.wikimedia.org/wiki/File:Carpodacus_mexicanus_QC.jpg — *Carpodacus mexicanus QC by Simon Pierre Barrette is licensed CC BY-SA 3.0.*

The main goal of metadata is to describe and enable retrieval. Any metadata present should facilitate the tasks performed when using general or academic search engines, which are:

Finding: finding resources that match the search criteria.
Identifying: to establish the context of the data and to confirm that the resource that is described corresponds to the resource that is sought or to be able to distinguish between two or more resources with similar characteristics.
Selecting: selecting a resource that is relevant to the needs of the searcher.

Metadata necessary for preservation are those that ensure the authenticity and long-term accessibility of digital resources and that allow recovered files to be accessible, readable, and intelligible. Metadata need to be managed and discovered independent of the systems they were created with.

Metadata Normalization

Some metadata can be standardized, such as the names of those responsible for a research project, the methods of data collection and analysis, variable titles, subjects covered by the research, as well as temporal or geographical coverage. Other types of metadata will adopt less precise description rules. They aim to standardize the display of the resource being described. This includes, for example, the title attributed to a research project or an abstract describing a dataset.

The more metadata are standardized, the more they contribute to the FAIR principles (detailed further in chapter 2, “The FAIR Principles and Research Data Management”) and the more they allow for the Findability, Accessibility, Interoperability, and Reuse of the resources they represent. When describing a resource, whether it is data or a dataset, it is necessary to select the most useful metadata to maximize time and effort.

Several methods can be used to standardize metadata. However, there is often terminological confusion, as certain terms are used to incorrectly describe varying concepts.

Metadata Schemas

To fully understand metadata schemas, imagine an online form with empty fields to fill in. The schema hides behind the form and gives meaning to the information added to each field.

Some schemas specify the syntax with which elements should be encoded, while others, such as Dublin Core and Data Documentation Initiative (DDI), only provide fields for storing information without giving any indication as to how the content should be entered or its syntax.

Let’s take the house finch as an example. A birdwatcher wants to enter a sighting of the bird in a repository that uses the Darwin Core metadata schema. He will need to fill in the following fields:

Fields to fill	Darwin Core elements behind the scenes
Time of sighting	eventDate
Observer	identifiedBy
Scientific name	scientificName
Kingdom	kingdom
Class	class
Order	order
Family	family
Genus	genus

There are many schemas, some are general while others are disciplinary. A standardized schema which is widely used can be machine-readable, which increases the visibility of data and the possibility of its reuse. However, these advantages are lost when creating an in-house metadata schema.

In summary, a metadata schema serves as a structure and a container for information about datasets and, to some extent, adds to its meaning.

Description Rules

Description rules make it possible to standardize, normalize, and structure information relating to datasets. These rules will prescribe the transcription of information, the use of capital letters, as well as element syntax or order. Rules are schema independent and can be used in any data repository.

To illustrate, let’s use the example of the finch-enthusiast birdwatcher. He wants to know if this species has been sighted in his area on a specific date. He looks up three repositories that use the Darwin Core schema. Searching with the date October 10, 2021, he finds results in only one of the repositories. Why? Because repositories use different description rules for dates. One has no requirements, being the repository where the October 10, 2021 entry is found; the other asks for the ISO 8601 standard, which is YYYY-MM-DDTHH:MM:SSZ and where the date is indicated as 2021-10-10; and the last one requires the form DDMMYYYY and where the desired entry is represented by 10102021.

Clear description rules are also very useful for personal names, especially in the case of common names. It’s important to avoid the use of initials, homonyms, or pseudonyms. Depositing data gives visibility to researchers, but to do this, it is necessary to be able to identify, without ambiguity, the person responsible for the data.

A name is sometimes not enough to distinguish between people, and this is why it is recommended to also use Persistent Unique Identifiers (PID), like ORCiD.

Controlled Vocabularies

Controlled vocabularies standardize indexing and make it easier to find and locate information. It is a set of terms recognized, standardized, and validated by a group or community of practice used to index or analyze the content of a resource.

If several terms refer to the same concept, only one of them will be chosen and identified as the “preferred term;” all others, considered as possible synonyms, will be mentioned as “rejected terms.”

Let’s go back to the birdwatcher who, this time, is looking for information on the finch in an English-language data repository. The data in this repository is indexed with free vocabulary, but also with FAST (Faceted Application of Subject Terminology). To retrieve all the information on the species, the birdwatcher searches for the term “finch” and discovers that “house finch” is the term chosen by FAST. With this term, he can successfully search the repository and retrieve all available data.

Thesauri and subject heading directories are the most common and well-known examples of controlled vocabularies. There are encyclopaedic vocabularies, but also specialized vocabularies specific to certain disciplines, e.g., ERIC, a thesaurus that specializes in education, or WORMS, a catalogue of the names of marine organisms.

Several of these vocabularies are multilingual, or process linguistic equivalents, which is a valuable contribution for interoperability.

Digging deeper: Ontologies

An ontology is a theoretical representation of a domain of knowledge with concepts linked by semantic and logical relations. It includes vocabularies and definitions, and specifies how concepts are interrelated. An ontology makes it possible to establish a set of relations and to describe specific situations in a given domain. It also imposes a structure on the domain and limits the possible interpretations of terms. Put simply, an ontology makes it possible to offer a common language to blocks of information linked to each other. It is to metadata what grammar is to language.

One of the main advantages of using an ontology is the interoperability, reuse, and sharing of metadata. The main difference between an ontology and a controlled vocabulary is that the controlled vocabulary proposes semantic relations between the elements that compose it, while the ontology will propose functional relations making it possible to describe situations precisely.

For example, in a controlled vocabulary, “house finch” is the preferred term. It is related to “Carpodacus,” which is the general term, as well as “Mexican finch” and “Carpodacus mexicanus,” which are two rejected terms. In an ontology, “house finch” could be linked through the relationship “habitat” to the terms “suburb” and “semi-desert.” The ontology could also point to the “feeding” relationship to make a link between the finch and other “granivores” and “insectivores.”

Types of Metadata

There are various ways of categorizing metadata. In this chapter, the following groupings will be used: descriptive, structural, technical, access, and preservation metadata. The last three types of metadata in the list are less straightforward to understand. They are introduced below for those interested in gaining more advanced knowledge on the topic.

Beyond these categories, metadata can also be classified by their source (internal, external), their mode of creation (manual, automatic), their status (static, dynamic), their structure (structured or not), and other characteristics. For more information on this, please consult the resources at the end of this chapter.

Descriptive Metadata

As their name suggests, descriptive metadata are used to describe a resource’s content and ensure that it can be found, whether by humans or by machines. The title of a work, the name of its creator, and the date of creation are examples of descriptive metadata found in data repositories, library catalogues, or databases.

In the case of research data, descriptive metadata generally refer to fields to be filled in data repositories. In addition to metadata, in cases where the data are not deposited in the repository, a text file, such as a README file, can be used to support descriptive metadata.

Project metadata describe the “who, what, where, when, and why” of a dataset, which provides context for understanding the purpose of data collection, methodology, and use.

Dataset metadata are more granular. They describe and contextualize the data in more detail, including, for example, variables, units of measurement, and observations. This information may also be present with the data themselves.

The rules to follow for descriptive metadata are not insignificant. The better a dataset is described, the more it will be identifiable and the easier it will be to attribute credit to the right people. In this sense, the use of unique identifiers such as DOIs and ORCiDs as well as controlled vocabularies such as FAST and its French-language equivalent, RVMFAST, makes it possible to disambiguate people and digital objects. Metadata standardization also supports interoperability between systems.

The best way to harness the power of descriptive metadata is to:

use unique identifiers where possible.
use existing metadata schemas well established in your research community.
standardize metadata where possible (names, subjects, geospatial coordinates, dates, etc.), ideally with controlled vocabularies.
follow the advice suggested by repositories for completing their metadata fields, i.e., mandatory fields, recommended fields, and optional fields.

Each discipline uses their own metadata, schemas, ontologies, and controlled vocabularies. For some examples of these particularities, see the chapters “Managing Quantitative Social Science Data” and “Managing Qualitative Research Data.”

Interesting fact: Many files have descriptive metadata embedded in their format. Have you ever looked at the file properties attributed by a software application or your operating system? You might be surprised! Sometimes a software application automatically fills in the “author” information with the name of the owner of the software or inserts geographic coordinates into the file of a photo taken with a cellphone!

Structural Metadata

Structural metadata help establish links between and within files. It is as much about the physical structure of a file (the links between different pieces of content) as it is about the logical structure of a document (the links between files). For example, you might have an article in a PDF format and the associated graphics in a different file, in DOCX. You might also have information about where text and images are located on a page, and information about page order.

Some of these metadata are generated automatically, others must be entered manually. They can be useful if you have to switch from a complex format to a simple format and doing so would require breaking down your data. You may need to describe the links between your files to represent the original format. This information can be noted in a text file or by using code.

If your files are not independent or they refer to other files, think about the structural metadata. They will allow you to fully understand your data.

Digging Deeper: Other Types of Metadata

Descriptive and structural metadata are fairly easy to understand, even though their exact definitions may be debatable. However, definitions for technical, access, and preservation metadata are more ambiguous. Sometimes these metadata are grouped together under the term “administrative metadata.” The divisions below are used for explanatory purposes only.

Most of the metadata below are created automatically within files and it is not essential to know them. It is possible to modify some of this internal metadata and indeed, some software applications allow their extraction to keep them separate. However, good knowledge of formats and metadata is recommended before attempting to do this.

As mentioned previously, a format change can be positive for the long-term preservation of files. Such a conversion may impact the file’s internal metadata. Extracting these metadata from the original format and keeping them alongside the digital object allows for the provenance and authenticity of the files to be documented.

Technical Metadata

Technical metadata are highly format-specific and mostly always embedded within files. They document the creation of the file (software used, version, operating system, date of creation and last modification, etc.) and the characteristics of the digital objects which vary according to the type of format.

Examples of technical metadata include:

For text: encoding, structure in XML …
For images: resolution, colour profile, encoding depth …
For sound: bitrate, codec, sample rate …
For video: number of frames per second, colour profile, duration …
For web content: format declared in the header, server response collected …

Extracting technical metadata helps prove that a format is what it claims to be. It provides information about an unknown or corrupted digital object.

Access and Use Metadata

Access and use metadata include information that allows the research community to download data and reuse it legally.

To avoid any rights violations, these metadata provide information on the provenance, the possibilities of access (open access, embargo, confidentiality form, etc.) and of use (free, with citation, read-only, etc.). It may also include digital signatures. These metadata make it possible for repository administrators to carry out preservation actions in a legal manner.

Preservation Metadata

Preservation metadata are usually tied to specific metadata schemas like METS or PREMIS and represent the actions performed on files to preserve them.

They include everything related to the integrity and authenticity of a digital object (see the chapter, “Digital Preservation of Research Data,” for more on this topic). Minimally, a checksum should be calculated. With preservation metadata, you can trace all changes made to a file such as format changes, checksum checks, and physical media moves, as well as those who made the changes.

Conclusion

The title of this chapter refers to a fascinating world for good reasons. We have only offered a preliminary survey into the world of file formats and metadata. Be assured, however, that it is not essential to master all the secrets of file formats, controlled vocabularies, or metadata schema to ensure accessible and reusable data in the long term.

Reflective Questions

Key Takeaways

The choice of a format depends on several factors, but mainly on the needs and capacities of those who use them.
The best research data cannot be found and understood, including by those who created them, without quality metadata. Quality is preferred over quantity.
View formats and metadata as allies and not obstacles; you may find they are, at times difficult, but always reliable friends!

Additional Readings and Resources

Corti, L., Van den Eynden, E., Bishop, L., Woollard, M., Haaker, M., & Summers, S. (2019). Managing and sharing research data: a guide to good practice (2nd ed., vol. 1). Sage.

Formats

Canadian Resources

Bibliothèque et Archives nationales du Québec. (2020, March). Guide concernant les formats recommandés par BAnQ. https://numerique.banq.qc.ca/patrimoine/details/52327/4076856

Bieman, E., & Vinh-Doyle. W. (2019). National heritage digitization strategy – Digital preservation file format recommendations. Government of Canada, Canadian Heritage Information Network. https://www.canada.ca/en/heritage-information-network/services/digital-preservation/recommendations-file-format.html

Library and Archives Canada. (2022). Guidelines on file formats for transferring information resources of enduring value. https://library-archives.canada.ca/eng/services/government-canada/information-disposition/guidelines-information-management/Pages/guidelines-file-formats-enduring-value.aspx

Library and Archives Canada. (n.d.). File format guidelines for preservation and long-term access version 1.0. https://www.councilofnsarchives.ca/sites/default/files/LAC%20File%20Format%20Guidelines%20for%20Preservation%20and%20Long-term%20v1_2010-12_0.pdf

Other Resources

Bibliothèque nationale de France. (n.d.). Fiches formats. https://github.com/hackathonBnF/FichesFormat/wiki

Caplan, P. (2008). What Is digital preservation? Library Technology Reports, 58(2). https://journals.ala.org/index.php/ltr/article/view/4224/4809/

Caplan, P. (Ed.). (2010). Digital preservation [Special issue]. Information Standards Quarterly, 22(2). https://www.niso.org/sites/default/files/2019-07/ISQ%20Spring%202010.pdf

Centre de coordination pour l’archivage à long terme de document électroniques. (n.d.). Catalogue des formats de fichiers pour l’archivage. https://kost-ceco.ch/cms/kad_main_fr.html

Dappert, A. (2016). Digital preservation metadata and improvements to PREMIS in version 3.0 [PowerPoint Presentation]. https://www.loc.gov/standards/premis/v3/tutorialslides.pdf

Digital Preservation Coalition. (2015). Digital preservation handbook (2nd Ed.). https://www.dpconline.org/handbook

Digital Preservation Coalition. (n.d.). Technology watch publications. https://www.dpconline.org/digipres/discover-good-practice/tech-watch-reports

Digital Preservation Coalition, & Artefactual System. (2021). Preserving audio. http://doi.org/10.7207/twgn21-11

Digital Preservation Coalition, & Artefactual System. (2021). Preserving databases. http://doi.org/10.7207/twgn21-06

Digital Preservation Coalition, & Artefactual System. (2021). Preserving documents. http://doi.org/10.7207/twgn21-07

Digital Preservation Coalition, & Artefactual System. (2021). Preserving GIS. http://doi.org/10.7207/twgn21-16

Digital Preservation Coalition, & Artefactual System. (2021). Preserving moving images. http://doi.org/10.7207/twgn21-12

Digital Preservation Coalition, & Artefactual System. (2021). Preserving raster images. http://doi.org/10.7207/twgn21-13

Digital Preservation Coalition, & Artefactual System. (2021) Preserving spreadsheets. http://doi.org/10.7207/twgn21-09

Federal Agencies Digital Guidelines Initiative. (n.d.). Guidelines, file format comparison projects. https://www.digitizationguidelines.gov/guidelines/File_format_compare.html

Federal Records Management. (n.d.). Appendix A: Tables of file formats. National Archives and Records Administration. https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html

Library of Congress. (n.d.). Recommended formats statement. https://www.loc.gov/preservation/resources/rfs/

Loftus, C. (2019, August 23). File format identification: A student project at the University of Sheffield Library. Digital Preservation Coalition. https://www.dpconline.org/blog/file-format-identification-sheffi-uni

McLellan, E. P. (2007) General study 11 final report: Selecting digital file formats for long-term preservation. InterPARES 2 project. http://www.interpares.org/display_file.cfm?doc=ip2_file_formats(complete).pdf

UK Data Service. (n.d.). Recommended formats. https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/recommended-formats/

Vitam. (2020). Identification des formats de fichier. https://www.programmevitam.fr/ressources/DocCourante/autres/fonctionnel/20200131_NP_Vitam_preservation-identification-format-v2.0.pdf

Games on Formats

Archives & Records Association. (2022). File format or fake? https://www.exploreyourarchive.org/archives/digital-preservation/

Fortin, É., & Ruest, J.-F. (2022). Mille formats. Bibliothèque de l’Université Laval. https://www5.bibl.ulaval.ca/formations/tutoriels-en-ligne/autres-tutoriels/mille-formats

Metadata

Baca, M. (Ed.). (2016) Introduction to metadata (3e éd.). Getty Publications. http://www.getty.edu/publications/intrometadata/

Bascik, T., Boisvert, P., Cooper, A., Gagnon, M., Goodwin, M., Huck, J., Leahey, A., Stathis, K., & Steeleworthy, M. (2021). Dataverse north metadata best practices guide v 3.0 (Version 3). Zenodo. https://zenodo.org/record/5668945

Bibliothèque Université Laval. (n.d.). RVMFAST. https://rvmweb.bibl.ulaval.ca/rvmfast/rechercheSimple.do

Canning, E., Brown, S., Roger, S., & Martin, K. (2022). The power to structure: Making meaning from metadata through ontologies. KULA: Knowledge Creation, Dissemination, and Preservation Studies, 6(3). https://doi.org/10.18357/kula.169

Digital Research Alliance of Canada. (2021). RDM and metadata for discovery: What’s in it for researchers? [Video]. YouTube. https://youtu.be/4fjPBSKMPlw

DoRANum. (n.d.). Métadonnées, standards, formats : comment décrire les données? https://doranum.fr/metadonnees-standards-formats/

Dublin Core. https://www.dublincore.org/

ERIC. https://eric.ed.gov/

Guenther, R. (2017). Metadata for digitization and preservation. Part 1: Metadata schemes [PowerPoint Presentation]. Lyrasis.

Lacroix, C. (2017). Meilleures pratiques de gestion des métadonnées décrivant les données de recherches [Webinar]. Bureau de Coopération Interuniversitaire. https://libguides.biblios.bci-qc.ca/ld.php?content_id=36275448

OCLC FAST. https://fast.oclc.org/

ORCiD. https://orcid.org/

Research Data Management Service Group. (n.d.). Guide to writing “readme” style metadata. Cornell University. https://data.research.cornell.edu/content/readme

Supporting public procurement in Europe – 4 RDA Recommendations for open data sharing now published as ICT Technical specifications. (2017, July 24). RDA. https://www.rd-alliance.org/node/57123

UK Data Archives. (n.d.). Standards and procedures. https://www.data-archive.ac.uk/managing-data/standards-and-procedures/

WORMS: World Register of Marine Species. https://www.marinespecies.org/

About the author

name: Émilie Fortin

Émilie Fortin has been Research Data Management and Digital Preservation Librarian at Université Laval since 2021. Prior to this, she was the librarian responsible for digital production, preservation and conservation of collections. She completed her Master’s degree in Information Science at Université de Montréal, spending a year at the Haute école de gestion in Geneva. She is involved in the Digital Research Alliance’s Preservation Expert Group as well as the Partenariat des bibliothèques universitaires du Québec (PBUQ) working group on research data management, and is also a regular participant in iPRES conferences on digital preservation. ORCID: 0000-0002-9717-6840

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Research Data Management in the Canadian Context Copyright © 2023 by Émilie Fortin is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Digital Object Identifier (DOI)

https://doi.org/10.5206/QSVH4764

Introduction

File Formats

What is a File Format?

What is a Sustainable Format?

How to Choose a Format for a Research Project?

Databases

Tabular Data

Text

Images

Audio

Video

Geospatial Data

Digging Deeper: How to Identify a Format?

Metadata

An Introduction to Metadata

Metadata Normalization

Metadata Schemas

Description Rules

Controlled Vocabularies

Digging deeper: Ontologies

Types of Metadata

Descriptive Metadata

Structural Metadata

Digging Deeper: Other Types of Metadata

Technical Metadata

Access and Use Metadata

Preservation Metadata

Conclusion

Additional Readings and Resources

Formats

Canadian Resources

Other Resources

Games on Formats

Metadata

About the author

License

Digital Object Identifier (DOI)

Share This Book