Glossary

Edited by Kristi Thompson; Elizabeth Hill; Emily Carlisle-Johnston; Danielle Dennie; Émilie Fortin

Glossary

k-anonymity: a mathematical approach to demonstrating that a dataset has been anonymized.
l-diversity: one of many privacy-protecting risk assessments based on k-anonymity but more restrictive.
active storage: a storage tier that supports data during the active phase of a research project, while data are being created, modified, or accessed frequently.
administrative: data collected as a part of the process of administering something. Administrative data is used to track people, purchases, registrations, prices, etc.
anonymization keys: documents used by qualitative researchers to de-identify their data in a systematic way. They connect information that is removed from original data (e.g., the name of an individual in an interview transcript) and replaced with more generic text (e.g., Person 6). The researcher then works with the anonymized transcript but can use the key to re-identify individuals, places, organizations, etc., if such information becomes important again during analysis. Anonymization keys must be password protected, stored securely, and never kept alongside the data in question. They are often destroyed upon completion of a study.
application program interfaces (APIs): a set of functions and procedures provided by one software library or web service through which another application can communicate with it.
Archival Information Packages (AIPs): an Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS (OAIS term). (Digital Preservation Handbook, n.d.).
archival storage: a storage tier that supports the series of managed activities needed to support long-term preservation of digital materials.
arguments: the values or variables that are provided to the function.
article processing charges: a publication fee charged to authors or their institutions for making their work open access.
ASCII: the American Standard Code for Information Interchange (ASCII) is a computer standard for character encoding. It contains 128 codes representing Arabic numerals from 0 to 9, the 26 letters of the Latin alphabet in lower and upper case, as well as mathematical and punctuation symbols.
audit trails: documentation that tracks activity and decision making throughout the life of a project, detailing what took place, when, and why.
backwards compatibility: backwards compatibility means that software can run on older hardware, or can read files created by an older version of the same software.
base map: an underlying or reference map that sits underneath the data, to give context to it. For example, if you make a map showing demographic information in particular census areas, then your map is harder to read without something to indicate where those abstract census area shapes are. Though you can also argue a map is an abstract representation as well, it is something people learn to read, and so can give positional information to situate the individual – so the base map allows that positional information to situate the data that is used overtop.
Biobanks: a repository that stores physical biological samples and biological data.
bit sequences: a precise sequence of bits (0 or 1) which, taken together, have a specific meaning. For example, they can represent a character, an operation to be performed (machine instruction), a color selection, a digital object, etc.
bit-level preservation: a level of preservation that commits to the preservation of the ordered ones and zeroes that comprise a digital object, but which does not necessarily address the understandability of the encoded data.
boxplot: also known as a box-and-whisker plot, a boxplot is a graphical representation of a dataset that displays the distribution of the data and any potential outliers.
camel case: writing text with no spaces or punctuation while using capital letters to distinguish between words.
categorical variables: a type of data that represent discrete categories. Ordinal categorical data are those that can be ordered or ranked sequentially. Examples include course letter grades (i.e. A, B, C, D, F) and Likert scales (5-point scale to measure latent constructs or phenomenon that cannot be observed directly). There are also nominal categorical variables, which cannot be ordered on a scale or in a sequence. These can be dummy-coded and included in a quantitative analysis. Examples of non-scalar categorical variables include gender, race, ethnicity, cities, etc.
checksums: unique numeric or alphanumeric strings of varying potential lengths produced by checksum-generating algorithms, like CRC, MD5, SHA1, and SHA256, based on the contents of a file.
cloud-based: a computational system that is distributed over more than 2 servers in more than 2 locations allowing for remote access via web browsers or APIs to compute power and/or data storage.
codebook: a document that describes a dataset, including details about its contents and design.
coding literacy: learning computer code has been compared to learning a new language. Coding literacy is the ability to comprehend computer code, much like mathematical literacy is the ability to comprehend math.
command-line tool: a computer program that can be run from the command line interface (CLI) of an operating system. The CLI is a text-based interface that allows the user to interact with the computer using typed commands, instead of using a graphical user interface (GUI) with menus and icons.
computational research: research that relies on computers for data creation and/or analysis.
CONTENTdm: an OCLC tool for managing and presenting digital content. See https://www.oclc.org/en/contentdm.html for more information.
controlled vocabularies: a list of standardized terminology, words, or phrases, used for indexing or content analysis and information retrieval, usually in a defined information domain (CODATA Research Data Management Terminology, CC BY 4.0).
CSV data file: a delimited text file that uses a comma to separate values within a data record.
Data Access Committee (DAC): an independent decision-making body whose purpose is to oversee access to datasets for research purposes.
data cleaning: the process of employing six core activities: discovering, structuring, cleaning, enriching, validating, and publishing data.
data dictionary: a machine-readable and often machine-actionable document, similar to a codebook, that generally contains detailed information about the technical structure of a dataset in addition to its contents.
Data Documentation Initiative (DDI): a standards-based metadata schema developed for social science data.
Data Management Plan (DMP): a formal description of what a researcher plans to do with their data from collection to eventual disposal or deletion.
data objects: for the purpose of the FAIR guiding principles, data object is defined as an Identifiable Data Item with Data elements + Metadata + an Identifier.
data packaging: the process of grouping data and information about data into a logical whole for use in a digital preservation process.
data stewards: while their role can vary, data stewards in a research context are individuals tasked with ensuring data are handled systematically and uniformly.
data twins: records in a dataset that have the same values on a set of indirect identifier variables.
de-identification: the process of removing from a dataset any information that might put research subjects’ privacy at risk.
delimiters: special characters reserved by computational systems or languages to denote independent objects or elements.
dependency: an additional software library that can be downloaded from the internet and used for specific programmatic tasks.
descriptive design: a type of study design concerned with exploratory questions (e.g. what? when? how? where?), which aims at exploring a phenomenon or observation to describe an effect.
Designated Community: a conceptual entity introduced by OAIS, representing potential users of a digital object being preserved by an archive. Designated Community is a crucial concept in long-term preservation planning because understanding the needs and capabilities of the Designated Community allows for informed decision-making regarding things like choices of file formats and retention of data.
digital humanities: an academic field concerned with the application of computational tools and methods to traditional humanities disciplines such as literature, history, and philosophy.
digital materials: any piece of information, either singular or in assemblage, that is stored by computers. They are called digital because all computer-readable versions of data are ultimately encoded as a series of ones and zeroes, which are the only inputs computing systems can understand.
Digital Object Identifier (DOI): a name (not a location) for an entity on digital networks. A DOI provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI is a type of Persistent Identifier (PID) issued by the International DOI Foundation. This permanent identifier is associated with a digital object that permits it to be referenced reliably even if its location and metadata undergo change over time (CODATA Research Data Management Terminology, CC BY 4.0).
digital preservation: the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.
digital signatures: the equivalent of a handwritten signature on paper which offers guarantees on the authenticity of the identity of the signatory.
direct identifiers: information collected by the researcher that can uniquely identify human subjects, and include things like names, phone numbers, social insurance numbers, student numbers, and so on.
DMP Assistant: a web-based tool which asks users a series of questions about their data and research plans, with contextual help and guidance on how to answer those questions.
Dublin Core: simple and generic metadata schema that uses 15 optional and repeatable core elements like title, creator, format, and date. Created in 1995, Dublin Core is also an international standard (ISO 15836).
dummy variable: a dummy variable is a text or non-quantitative variable that is assigned a number for the purpose of quantitative analyses. For example, a dataset that includes a variable for gender with options such as female coded as a 1, male coded as a 2, non-binary coded as a 3, and prefer not to respond coded as a 4.
electronic lab notebook: online tools built off the design and use of paper lab notebooks
emulation: a means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on future generations of computers (Digital Preservation Handbook, n.d.).
equivalence class: a set of records in a dataset that has the same values on all quasi-identifiers.
ethics approval: authorization to carry out a research study that’s granted by bodies variously referred to as: Ethics Review Boards, Research Ethics Boards, Research Ethics Committees, or Institutional Review Boards.
evidence-based data: evidence-based data comes in a variety of forms and is the result of some form of research activity, including data analysis, modeling, literature syntheses, and evaluations that produce guidelines and assessments of the implementation of a process or technology and its cost-effectiveness.
explanatory design: a type of study design concerned with causal relationships (i.e. causes and their effects, or questions concerning the "why" of an effect), which aims at explaining a phenomenon or observation in order to understand an effect.
Exploratory Data Analysis: a process used to explore, analyze, and summarize datasets through quantitative and graphical methods. EDA makes it easier to find patterns and discover irregularities and inconsistencies in the dataset.
FAIR: Findable, Accessible, Interoperable, Reusable.
FAIR principles: guiding principles to ensure that machines and humans can easily discover, access, interoperate, and properly reuse information. They ensure that information is findable, accessible, interoperable, and reusable.
file extensions: suffix assigned to a file to identify it. For example, a file created with Word software will have the extension DOCX.
file format: a standardized method of arranging ones and zeroes that can be used to encode specific types of information.
fixity: a concept relating to the permanence of digital objects. Establishing consistency in digital objects can be tricky, as the way they are stored means that objects are often copied or transmitted frequently, raising questions as to whether the resulting object is the “same” as the object before copying/transfer. In common practice, fixity is closely tied to the generation and verification of checksums, which can help ensure that an ordered series of bits have remained unchanged.
fork: in GitHub, a copy of a dataset that retains a link to the original creators.
format obsolescence: a threat to the longevity of digital objects based on an inability to decode the bitstream comprising the digital object. Format obsolescence threats are often addressed through a program of file format identification, validation, and – if necessary – normalization/migration.
global data reduction: making changes to variables across datasets, such as grouping responses into categories.
histogram: a graphical representation of the distribution of a set of continuous or discrete data.
homogeneity attack: a method of violating the confidentiality of a group of research subjects that can happen when everyone with a particular set of demographic characteristics also have a particular sensitive characteristic.
identifying information: any information in a dataset that in combination could lead to disclosing the identity of an individual.
Indigenous data sovereignty: the right of Indigenous Peoples to collect, access, analyze, interpret, manage, distribute, and reuse all data that was derived from or relates to their communities.
indirect identifiers: also known as quasi-identifiers, these are characteristics of people that do not uniquely identify individuals on their own but may, in combination, serve to reveal someone’s identity. A characteristic should only be considered quasi-identifying if an attacker could plausibly match that characteristic to information in an external source.
integrated development environment (IDE): a software application that provides a comprehensive environment for software development. RStudio is an integrated development environment (IDE) that enables users to write, debug, run R code and display the corresponding outputs.
integration: the process of connecting different, often disparate systems or tools into a cohesive infrastructure.
integrity checking: can be linked to the definition already in the glossary for fixity.
interoperability: the ability of data or tools from non-cooperating resources to work with or communicate with each other with minimal effort using a common language.
interoperable: interoperability requires that data and metadata use formalized, accessible, and widely used formats. For example, when saving tabular data, it is recommended to use a .csv file over a proprietary file such as .xlsx (Excel). A .csv file can be opened and read by more programs than an .xlsx file.
interval measurement scale: an interval measurement scale refers to numbers that are equally distanced from each other in ascending or descending order and where zero may be a point on the scale (i.e. zero does not mean the absence of a value). Examples include temperature and time. In the case of the Celsius temperature scale, zero refers to the point at which water freezes, but not the absence of temperature.
iterative: an iterative approach to research is one in which ongoing review and adjustment are embedded into the research process. As a result, a study design may be further adapted based on what is learned as data are collected and analyzed.
knowledge mining: collecting Indigenous knowledge without seeking permission or consulting stakeholders in the community.
knowledge theft: collecting Indigenous knowledge without seeking permission or consulting stakeholders in the community.
Law 25: An Act to modernize legislative provisions as regards the protection of personal information
layers: the visual representation of a geographic dataset in any digital map environment. Conceptually, a layer is a slice or stratum of the geographic reality in a particular area, and is more or less equivalent to a legend item on a paper map. On a road map, for example, roads, national parks, political boundaries, and rivers might be considered different layers (ESRI, n.d.).
Likert scale: a Likert item is a question on a survey which asks respondents to select a response to indicate how much they agree or disagree with a statement. A Likert scale is developed by adding up or averaging a number of related Likert items.
literate programming: where code, commentary, and output display together in a linear fashion, much like a piece of literature.
local suppression: deleting individual cases or responses.
longitudinal design: a type of study concerned with the effect of time on an outcome. In other words, a study that measures an outcome at more than one point in time. For example, a longitudinal survey design involves repeating the same survey on the same individuals over time to understand changes in attitudes or behaviors.
loss of provenance: a threat to the longevity of digital objects based on members of the user community being unable to discern important information about the digital object, such as its source, its history of changes, and ultimately its authenticity. Threats to the provenance of a digital object are often addressed through the careful creation and maintenance of preservation metadata.
lossless compression: file size reduction mechanism that preserves all original data.
machine-readable metadata: metadata in a form that can be used and understood by a computer.
MAMIC: Maturity Assessment Model in Canada. A Canadian-specific RDM assessment tool designed to help evaluate the current state of institutional RDM services and supports as part of an institutional RDM strategy development process. It focuses on four areas of service and support — Institutional Policies and Processes, IT Infrastructure, Support Services, and Financial Support — and allows users to assess the maturity and scale of these services.
maturity assessment models: tools used to evaluate the level of sophistication of a service or product. These models measure the level of attainment in relevant capability areas using a scale (e.g., 0-4 or 1-3), which allows users to quantify capabilities and enable continuous process improvement.
Maturity Level: in the MAMIC, a measure of how complete a particular element is in relation to RDM. The lower the level, the less developed (mature) the element is.
media degradation: a threat to the longevity of digital objects based on the decay of the carrier medium upon which they are stored. Sometimes called “bit rot.” Media degradation threats are often addressed by preservation actions that ensure bit-level integrity, including the active monitoring of digital objects to detect corruption/loss, and are often protected by maintaining multiple copies of an object on different pieces/types of media.
media obsolescence: a threat to the longevity of digital objects based on the notion that the media upon which they are stored may no longer be usable because a user would not have the correct hardware (or software like drivers) to access the data on the media. At the time of this writing, media obsolescence is commonly associated with floppy disks or various data cartridge formats that have fallen out of common use over time. Media obsolescence threats are often addressed by bit-level integrity methods, including the migration of digital objects to newer, more modern carriers on a regular basis.
metadata: data about data; data that define and describe the characteristics of other data.
metadata schemas: a grouping of elements intended to describe a resource. For each element, the name and the semantics (the meaning of the element) are specified. Content rules (how content should be phrased), representation rules (e.g., capitalization rules), and allowed element values (e.g., from a controlled vocabulary) may be optionally specified, but this is not always the case.
modèle d’évaluation de la maturité de la GDR au Canada: the French translation of the Maturity Assessment Model in Canada (MAMIC). See the MAMIC glossary entry for more.
multifactor authentication: multi-factor authentication requires two things: a password and a device. When you use your password to sign into a service, your login prompts a request for a one-time code generated by a device such as a cellphone or a computer. One-time codes may be delivered by text message or email, or they may be generated on your device via an authentication app like Google Authenticator. Many banks and government organizations, such as Canada Revenue Agency, now require users to enable two-factor authentication.
non-proprietary: not owned by a company.
normalization: process of converting copies of original files to one of a small number of non-proprietary, widely-used, and preservation-friendly formats during ingest. Normalization standardizes ingested material into a subset of formats stored by an archives, and allows the archives to avoid managing a large number of formats into the future. However, normalization can also alter file sizes and properties. Archives should assess normalization priorities and approaches through researching and defining file format policies (Scholars Portal, n.d.).
OAIS: (ISO 14721) the Open Archival Information System. Published in 2005 and revised in 2012, OAIS defines a set of requirements for an information system meant to maintain the usability of digital objects over time.
oblique photos: aerial photograph taken with the axis of the camera held at an angle between the horizontal plane of the ground and the vertical plane perpendicular to the ground. A low oblique image shows only the surface of the earth; a high oblique image includes the horizon (ESRI, n.d.).
OCAP®: an acronym for ownership, control, access, and possession. These four principles govern how First Nations data and information should be collected, protected, used, and shared. OCAP® was created because Western laws do not recognize the community rights of Indigenous Peoples to control their information.
open access: the free, immediate, online availability of information coupled with the rights to use this information fully in the digital environment.
open data: online, free of cost, accessible data that can be used, reused, and distributed provided that the data source is attributed.
open format: the format’s technical specifications are public; the information that helps to understand its operation and its structure are accessible.
open science: the movement to make scientific research, data, and dissemination transparent and widely accessible without barriers, financial or otherwise.
open source: when software is open source, users are permitted to inspect, use, modify, improve, and redistribute the underlying code. Many programmers use the MIT License when publishing their code, which includes the requirement that all subsequent iterations of the software include the MIT license as well.
OpenRefine: an open source data manipulation tool that cleans, reshapes, and batch edits messy and unstructured data.
operationalize: operationalizing variables means creating quantitatively measurable definitions of abstract concepts or constructs that cannot be measured directly.
ORCiD: unique identifier for members of the research community, defined by a permanent numeric code with two main functions: to link the person to their research activities, including their publications, and to distinguish them from others.
outliers: data points which dramatically differ from others in the dataset and can cause problems with certain types of data models and analysis.
p-sensitive k-anonymity: one of many privacy-protecting risk assessments based on k-anonymity but more restrictive.
password manager: a computer program that stores passwords. Some password managers also create and suggest complex passwords for use.
peer debriefing: the process of study team members questioning one another about what they have seen and heard. Such discussions are themselves sometimes included in a study’s final dataset.
persistent identifier (PID): a long-lasting reference to a digital object that gives information about that object regardless of what happens to it. Developed to address “link rot,” a persistent identifier can be resolved to provide an appropriate representation of an object whether that objects changes its online location or goes offline (CODATA, CC BY 4.0).
population unique: a person in a population who may be identifiable because of some unique combination of demographic characteristics.
pre-prints: preliminary version of an article that has not undergone a formal peer-review process, but may be shared for comment. Pre-prints may be considered as grey literature.
PREMIS metadata standard: a metadata standard and data dictionary developed to standardize the way that preservation systems record and understand important concepts in the long-term preservation of a digital object. PREMIS flies can include technical information (e.g., file format information, checksums) as well as provenance information (e.g. changelogs, acquisitions information).
provenance: a record of the source, history, and ownership of an artifact, though in this case the artifact is computational.
qualitative data: data generated by research examining social aspects of the human condition using descriptive methods rather than measurement.
quartiles: the values that divide a list of numbers into quarters.
R object: a data structure that contains a set of values of a particular type. R objects can be created, modified, and used to perform computations and analyses.
raster data: data that represents spaces as a regular grid or series of cells, each with a particular value – often thought of as the pixels of an image. For example, a scanned historical map or an air photo.
ratio scale: a ratio numerical scale may increase or decrease according to a denominator rather than equal distances. On a ratio measurement scale, zero is not a point on the scale, but rather, means the absence of a value. Population density is an example of a ratio measure. In the case of population density, zero refers to a place with no human inhabitants.
RDM maturity assessment: an evaluation of the current state of RDM services and supports, usually at a specific institution.
RDM policies: higher level plans outlining generalized courses of action for RDM (e.g., Tri-Agency Research Data Management Policy).
RDM practices: specific enactment of RDM or support services (e.g., University of Alberta RDM; McMaster University RDM Services).
RDM principles: top level values or concepts intended to guide RDM overall (e.g., FAIR principles, OCAP® principles)
RDM strategies: mid-level plans intended to achieve a set of goals or priorities when managing research data (e.g., Dalhousie University Institutional RDM Strategy, University of Waterloo RDM Institutional Strategy Project).
README file: a plain text file that includes detailed information about datasets or code files. These files help users understand what is required to use and interpret the files, which means they are unique to each individual project. Cornell University has a detailed guide to writing README files that includes downloadable templates (Research Data Management Service Group, n.d.).
reflexive: reflexivity is the process by which qualitative research acknowledge, examine, and account for the impact their own judgments, practices, and beliefs have on data collection and analysis.
replicable research: replicable research is research which can be repeated by other researchers on new or different data, getting the same or similar results as the original researchers.
repository storage: a storage tier that supports deposit, storage, discovery, and appropriate access to authoritative copies of digital materials in a variety of formats.
reproducible research: reproducible research is research that can be repeated by researchers who were not part of the original research team using the original data and getting the same results.
research data: sources of information or evidence that have been compiled to serve as input to research.
research data lifecycle: the cycle in which data is collected, processed, analyzed, preserved, and then shared so other researchers can start the cycle anew.
Research Data Management (RDM): a term that describes all the activities that researchers perform to structure, organize, and maintain research data before, during, and after the research process.
right to be forgotten: “the data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay” (GDPR.EU, 2018).
sample unique: an individual in a dataset whose information does not match any other individual in the dataset on the indirect identifiers.
script files: text files containing a sequence of R commands that can be run one after another
secondary analysis: research that uses data collected previously to conduct a new study.
self-determination: the right of Indigenous Peoples to determine what is best for their social, cultural, and economic development, and to carry out those decisions in a way that is best for their people. This definition is based on the United Nations Declaration on the Right of Indigenous Peoples (UNDRIP).
sensitive data: data which cannot be shared without potentially violating the trust of or risking harm to an individual, entity, or community.
signature: a series of bytes that occur in a predictable manner at the beginning and often the end of a file.
social sciences: a meta-disciplinary category encompassing scholarly disciplines that employ scientific methodologies and approaches to study social, cultural, affective, and behavioral human phenomena. Examples of social science disciplines include sociology, political science, economics, psychology, information studies, and more.
software container: like a self-contained virtual computer within a computer. It includes everything required to run a piece of software (including the operating system), without the need to download and install any programs or data.
survey piping: wording automatically inserted by survey software based on previous responses.
tab-separated values files (TSV): a delimited text file that uses a comma to separate values within a data record.
tabular data: data arranged in the form of tables, i.e., in rows and columns.
tabular format: a format in which information are entered into a table in rows and columns.
TCPS 2: Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans. The primary harmonized framework that accounts for Canadian-wide laws and broader ethical paradigms applicable to the rights of human participants in research
the agencies: the Natural Sciences and Engineering Research Council of Canada (NSERC), the Social Sciences and Humanities Research Council of Canada (SSHRC), and the Canadian Institutes of Health Research (CIHR) (the agencies) are Canada’s three federal research funding agencies and the source of a large share of the government money available to fund research in Canada.
traceable research: traceable research is research where external researchers can understand and repeat every change made to the raw data to get it into final shape for analysis.
traditional knowledge: collective knowledge of the traditions and practices that were developed over time and used by Indigenous groups to sustain themselves and adapt to their environment. Traditional knowledge is passed from one generation to the next within Indigenous communities. Indigenous knowledge comes in many forms including, storytelling, ceremony, dance, arts, crafts, hunting, trapping, gathering, food preparation and storage, spirituality, beliefs and worldviews, and plant medicines.
Tri-Agency Research Data Management Policy: a policy applying to data collected with research funding from one of Canada's three federal funding agencies. The policy is intended to encourage better research by requiring researchers to create data management plans and preserve their data.
unicode encoding: unicode is a character encoding standard that is not linked to any alphabet formats or encodings. It enables the exchange of texts in different languages.
vector data: data that comprises individual points that refer to specific locations. These points can be joined to form lines or enclosed shapes (polygons). The points, lines, and polygons can each be treated as individual units with associated data.
version control: a system for automatically tracking every change to a document or file, allowing users to revert to all previously saved versions without needing to continually save copies under different file names.
versioning: also known as version control, this means keeping track of the changes that are made to a file, no matter how small. This is usually done using an automated Version Control System, such as GitHub. Many file storage services, such as Dropbox, OneDrive, and Google Drive, keep historic versions of a file every time it is saved. These versions can be accessed by browsing the file's history.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Research Data Management in the Canadian Context Copyright © 2023 by Edited by Kristi Thompson; Elizabeth Hill; Emily Carlisle-Johnston; Danielle Dennie; and Émilie Fortin is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

License

Share This Book