"

Section 4 – Data Deposit, Sharing, and Archiving

Reviewing and taking care of community research data

Mikala Narlock; Lucia Costanzo; and Amber Gallant

Now that you’ve decided who should get access to the data and how, you can consider reviewing and taking care of the data itself. This toolkit is intended for curators of community data, whether they are community members, researchers, or part of an organization. Curators collect and manage data over the course of a project, including preparing it for depositing it into a repository. Whether you are an individual collecting data alone or part of a large organization, the guidance contained in this tool can help you.

Librarian bends over a book
SilesianSoul, A librarian from Silesian University of Medicine Main Library, CC-BY 4.0, https://commons.wikimedia.org/wiki/File:Librarian_from_Circulation_Departament_in_Library_of_Medical_University_of_Silesia_in_Katowice.jpg

 

This tool can be adapted from small projects to larger ones, with multiple people involved! While questions won’t change, your answers and approach may be quite different.

Small Org: With only 3–5 staff, you should think about which staff will be responsible for reviewing data and preparing it for deposit. Small groups may prefer to have the participating researcher’s university or research assistants take responsibility for review and preparation, or to deposit the data in the university’s established data repository. In this case, the org, their community, and the participating researcher should still discuss what information needs to be removed from the data to keep the community and its individuals from harm.

Icon: CrowdLarge Org: Larger organizations with funding to undertake dedicated curation activities might want to consider hiring a curator to review and take care of the collected data or sharing this cost with a partner organization. This curator might be a trusted colleague from a partner organization who can be contracted, or someone from within the community with curation experience. This approach can be advantageous when the curator is working with sensitive information; their connections to the community or prior experience with the community gives them an understanding of what should be removed and what should be kept.

Icon: Person aloneNo Org: If a researcher is conducting community-based research with a more ad-hoc group, consider depositing with a repository where there is support for curation activities – for example, at an academic library or non-commercial repository like Dataverse. The researcher might also choose to work with a third-party stakeholder group identified by the community themselves to support curation. There are many freely available resources online on how to curate data for sharing and storage you can draw upon for guidance.

Beginning to curate community data

The first step in curating community data is to check what you’ve received/created. At this stage, you are taking stock of the dataset to inform later curation actions. This initial review is an essential step that should not be rushed. You may also find it useful to record your findings in a separate document or digital notes to return to after the initial review.

You can also consider collaborating with a trusted colleague or a local librarian in this process. Sometimes it is difficult to fully see work we are close to, so having a second opinion can be useful for catching errors we did not see previously. If the data are sensitive check in with the community first about who can review and help with curation.

Below, you’ll find a few of the questions you and your team can ask yourselves as you first begin to curate your data, an explanation of their importance to the purpose of your project, and some examples for how you might answer them. This table should be filled out in consultation with community partners where appropriate, as the next section of this document will cover.

Question Purpose Example
How much data do I have? Determining how much data you have will influence where you can store it. If you have a lot of data, you may need to pay for storage. File types range in size and how much storage they require. Text files are usually only a few megabytes, whereas large audio or video files can be many terabytes.
What file formats are the data currently stored in? Knowing whether the data is stored in a format that can be opened with a variety of programs (non-proprietary) versus one program that requires a subscription or might one day not exist allows you to know what files you might need to change the formats to.

Prior to converting any files, be sure to create a working copy of the data in case anything goes wrong during the transformation process!

You might have a file in the format .doc or .docx, but this file format can only be opened with Microsoft Word. You can then export it to a non-proprietary format, like .txt, that could be opened with a variety of text programs.
Is the dataset complete? Are any data files missing? Are there any extra copies that we don’t need? Do you have all files needed to trace decisions or analyses back to the original data collected? Are there certain files you need to collect from their owner or steward to have a complete set? Answering this question will ensure that you have a complete set of files for future data reuse. You might have a de-identified set of survey answers but determine that you do not have the file that contains definitions for the variables included in that survey. You can ask for this file from the person that might have it.
What documentation is currently present about how the data were collected, what different codes or acronyms were used, or how the data were processed? You’ll need documentation about how the data is collected and organized to understand how it was used and how it might be used again, either by you or another person, organization, or institution seeking or re-use it. Your data files might include a README file, with a description of how the data was collected, or a document containing variable definitions that match the variables used in a survey.
Are there direct or indirect identifiers in the data that may need to be removed or anonymized? More on indirect and direct identifiers. Understanding whether the data needs to be anonymized or de-identified helps you protect those who participated in its creation. If you have survey data that contains identifiers (such as name) or a combination of identifiers that are unique (such as a combination of age, occupation, and street lived on), you’ll want to make sure these identifiers are removed.
If the data will be shared or stored in a data repository, this is an appropriate time to familiarize yourself with the policies and requirements of the repository. Understanding the policies of the place you choose to store your data will inform you of how long it is stored, who to contact if you have trouble depositing it or accessing it, and whether there are any restrictions on creating new versions should you need to. Let’s say you’re working with a researcher at an academic institution that has a data repository. You might choose to store the curated data in this repository. How long will they store the data? If you need to remove the data at a later point, will they support that? Before depositing, take a look at any training or policy documents you might need to make your decision.

Involving community partners in continued curation

After your initial review, you should begin to remedy any issues identified with the data. As part of this remediation work, you will need to collaborate with the community or research group to resolve challenges. In this collaboration, it is important to explain the “why” behind your asks. For example, instead of simply asking if there are missing files, provide context about why you’re asking and what these files may be.

The ask The “why” behind the ask
Converting the data into a different format By changing the format of the data, it will be usable for much longer, in a variety of programs instead of just the one it was created in.
Obtaining a README file that explains how the data was collected By including this with the data, it will be much easier for any reusers to understand and honor the original intentions of the data collection.
Retrieving a missing data file from the community The file may be crucial to representing the community as fully and holistically as possible. Of course, there may be reasons why you cannot have the file due perhaps to sensitivity or community protocol – which should be respected.

Consider which tasks need community input versus those you can accomplish independently. Use this information to identify 3–4 issues you need the community to address and take the rest as action items for yourself.

Keep in mind that the community may be utilizing data management practices that look different from your own but serve an important purpose within the community. For example, one step in curation asks you to consider whether files need to be transformed from one file type to another. Some file formats are proprietary, meaning that they can only be opened by specific software (e.g., Microsoft Excel files), and other file formats are simply outdated (e.g., Flash video files), which means they may be more likely to be unusable in the future. However, with community data, it is important that any transformation does not decrease the utility of the data. In certain instances, community members may be using a file format differently than originally intended. For example, a community group may use Excel not just for storage but also for creating charts that they may regularly update and share within the community. So converting these would remove the visual elements and formatting needed by the community for their reports. In some situations, it may be appropriate to retain multiple versions of the dataset to be useful both now and in the future.

Community Action Items:

  1. _________________________________________________________
  2. _________________________________________________________
  3. _________________________________________________________
  4. _________________________________________________________

Curator Action Items:

  1. _________________________________________________________
  2. _________________________________________________________
  3. _________________________________________________________
  4. _________________________________________________________
  5. _________________________________________________________
  6. _________________________________________________________

Other considerations in curating community data

When curating for community data, other things you might think about include where the data is being published and/or preserved. This will vary, based on who you might desire to have access to the data. The community might not want the data to be publicly available, in which case a general data repository, or even one hosted at an institution (which might encourage data that is as open as possible) might not be the right fit for the data. The community might also desire not to make the full dataset available, but instead to create a record and have people request access by contacting the Data Access Committee. In this case, there might need to be a review process developed in order to respond to access requests. These questions, and processes, should be part of your continued consultation with community.

Consider also if there should be a record of this data somewhere, even if the data are stored elsewhere. This may be important for demonstrating compliance with funder or publisher expectations (if you intend to publish a paper or report using the data, or have been funded to collect the data through a grant that requires you to do so), while protecting the community data. In this case the data should be documented enough that it can be understood but does not inadvertently expose community and/or members.

Access a fillable version of this resource to use in your own curation work:

 

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Community Research Data Toolkit Copyright © 2024 by McMaster University Libraries is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Digital Object Identifier (DOI)

https://doi.org/https://doi.org/10.71548/ftwp-wr49