Working with your data
➔Data Flow Model: Data Processing
Researcher questions
I have all this data, and I understand it, but how will my research assistant, community partners, or future scholars in my field understand it? How will they be able to work with my data?
? ? ? ! *
*Data Flow model questions that help you think through this complexity:
- File formats?
- Number and size of files?
- Storage & backup locations?
- How are you recording metadata?
The data processing phase of data curation management can be understood as the moment when you convert your data into a usable and desired form according to your research, preservation, and sharing goals. This is also when you actively convert, transfer, or transform data into machine readable formats (and ideally in a format that will be shareable amongst many projects, collaborators, and platforms!), and when you look at where there are gaps or misinformation in the data. You may find this phase to be quite iterative, or tedious, especially as it is often undertaken in combination with the data collecting phase: coming back to your data files to process your data as more information is discovered and collected at later stages of the project – we all do get excited about the analysis phase of the data gathered, but this phase will allow for better and greater analysis in the long run. Again, this step should be thought out and undertaken right at the start of the project; process your data as you collect it. The cost of processing all your data later on may be very high, especially in current research environments that deal with large data sets. Moreover, analyzing data that was not accurately processed could slow down the work or generate misleading, unreliable, or inaccurate findings.
When processing your data, document each step — the decisions you make are important to communicate to those who use the files after you and keep in a README file, for the same reasons you documented and described your data collecting process. This can be understood as recording your thought process, or your sorting process (I.e.: I’ll include this and exclude that). This is in sorts documenting your “intellectual labour” as opposed to your “intellectual product” (e.g., my book chapter). If you are accustomed to working in archives you can also think of this documentation as a “finding aid” that helps you communicate the provenance and order of the files according to the relationships that created them. This is another example that fits well into the ‘thinking process’ to allow others to understand and replicate. For example, if you edit an ‘original’ dataset (don’t forget to keep the original master file!) to conform to a given research project, this must be documented (changing headers for census data variables (which comes as long sentences) in an Excel table to be machine readable in another platform. Additionally — think about the boundaries of the research — which research files are to be shared among the partners and what are individual research files to be managed by individual researchers? Documenting the decisions made about processing the data will likely influence future analysis that will happen.
It is also at this point that you start processing each file’s metadata, that is to say, that you actively add the metadata to each file. Yes, here again you also need to add the metadata structures and decisions in the README file.
To take SpokenWeb (UBC Okanagan) as an example, the processing of analogue audio involves the creation of three levels of digital files:
- Master (.WAV),
- Master Access (.WAV) and
- Access (usually .MP3, derived from the Master Access).
The Master File captures all audio data to the standard of 96KH / 24bit and may, depending on the nature of the event recorded, include “background” noise, like chairs scraping the floor, or dish clinking. If an Access File (derived from the Master Access File) is cleaned and the background noise is removed, auditory information is being scrubbed from the record and these edits must be documented so users understand what versions they are working with. For some researchers, this background noise data might not be important, but in other cases, it will constitute the research focus.