Considering Types of Data
15 Managing Quantitative Social Science Data
Dr. Alisa Beth Rod and Dr. Biru Zhou
Learning Outcomes
By the end of this chapter you should be able to:
- Define different types of quantitative social science data.
- Describe specific ways Research Data Management practices might be implemented when working with quantitative social science data.
- Understand how good Research Data Management practices can help mitigate the reproducibility crisis and facilitate data deposit for reuse in the quantitative social sciences.
Introduction
The first step in managing quantitative research data in the social sciences is to review the typical research design and identify where Research Data Management (RDM) practices could be applied to facilitate research and bolster research outputs. Most quantitative social science research follows scientific study designs. These designs help researchers generate research questions, formulate hypotheses and concrete predictions, design the research project, collect and analyze the research data, and write up the results to communicate the findings to the public. To contextualize RDM in quantitative social science research, it is important to be aware of the process and workflow of these types of research projects. The next section will provide an overview of quantitative social science research studies as context for the remaining sections on quantitative social science data management.
Overview of Quantitative Social Science Research
There are two fundamental overarching approaches to quantitative social science research that may have implications for the collection and management of data. One approach researchers use is a descriptive design, which aims at exploring a phenomenon or observation to describe an effect (de Vaus, 2001). Common descriptive research includes studies performed by governments (e.g., household income levels, public library usage, noise complaints, traffic around cities over time, etc.). The goal of descriptive research is to describe social, economic, or political phenomena without delving into the cause of these phenomena. Research questions using descriptive designs might include:
- What is the poverty level of rural communities?
- Is the level of social inequality increasing or declining across Montreal?
- Where in Toronto are people more likely to be apprehended and convicted of crimes?
- Who is more likely to be apprehended and convicted of crimes in Alberta?
Another approach researchers may use in studying social phenomena is an explanatory design, which aims at explaining a phenomenon or observation in order to understand an effect (de Vaus, 2001). Explanatory studies are concerned with understanding the cause(s) of social, economic, and political phenomena. Explanatory studies are natural extensions of established descriptive research. For example, if a descriptive study establishes that a certain neighbourhood in a city has a significantly higher eviction rate than all other neighbourhoods, an explanatory study might investigate the reasons or causes for this discrepancy. Research questions using explanatory designs might include:
- Why is the eviction rate in “y city” highest out of all cities in Canada?
- Why are school buses significantly delayed in “z community”?
- Why is the poverty level in “x community” the highest in Manitoba?
Regardless of which approach is used for the study, the first step in the research process is to articulate a research question or a set of research questions. A research question states the purpose of the study in the form of a question. The following list includes some examples of the structure of potential research questions (with x, y, and z serving as placeholders for concepts):
- What is the relationship between x and y?
- How does the location of x affect y?
- What structural or demographic factors predict x, y, and z?
- Why does x affect y?
Here are some examples of versions of these questions incorporating real-world social concepts:
- What is the relationship between poverty and education?
- How does the location of public libraries affect community cohesiveness?
- What structural or demographic factors predict unemployment, economic insecurity, and demand for subsidized housing?
- Why does personality affect susceptibility to framing effects?
The research question will frame the subsequent steps in the design and execution of a quantitative social science study, which are described in the accordion below. Click on the tabs below to explore the different phases in a typical quantitative social science research process:
Quantitative Social Science Research Process
Good RDM practices are relevant to all phases in a typical quantitative social science research project, from planning to publishing research results. Data Management Plans (DMPs) are important tools to help researchers consider how to handle their research data in different phases of the research process. In the rest of this chapter, we’ll share some RDM considerations that are especially relevant when working with quantitative social science data.
Managing Quantitative Social Science Research Data: Files, Formats, and Documentation
Quantitative social science data is not inherently different from other types of quantitative data except in terms of the source(s) and focus of the data. Quantitative data are numerical data that are measured on an interval or a ratio scale, or categorical variables that are dummy-coded or converted to an ordinal scale. The most common method of collecting original quantitative social science data is through survey instruments.
A survey instrument, or questionnaire, is a series of questions asked of research participants, designed to measure a concept or multiple concepts. A survey questionnaire may include items, or questions, that operationalize multiple concepts — that is, turn them from abstract concepts into quantitatively measurable variables and indicators.
In addition to survey data, many social scientists also rely on administrative data. Administrative data refer to data that are collected by organizations or by government agencies for administrative purposes (i.e., not for research purposes but to administer or assess services, products, or goods). Examples of administrative data include vital statistics (e.g., birth and mortality rates), human resources records, municipal or individual tax information, budgets, locations of public services, and recipients of social service programs. It is important to note that administrative data that are not publicly available are typically governed by licenses or contracts that may affect data sharing and/or deposit. This was discussed in more detail in chapter 13, “Sensitive Data.”
In your RDM practice, consider the licenses on datasets when planning how the dataset might be shared or deposited at the end of a project. For example, certain contracts or licenses may dictate whether the dataset you are using may be later shared during a peer review process for verification of findings or whether the dataset may later be deposited for reuse by other researchers. Recall what you learned about licenses and sharing data in chapter 12, “Planning for Open Science Workflows.”
In most cases, regardless of whether the data are derived from original surveys or administrative sources, quantitative social scientists mostly collect and store their data in a tabular format.
Conventionally, tabular data are organized so that each row represents an observation (e.g., one research participant, one neighbourhood, one building, one year) and each column represents a variable (i.e., information that varies across observations). We’ll discuss alternative formats of tabular data (i.e., long vs. wide) in the following section.
There are several good practices related to the set-up of a tabular dataset. One best practice is to avoid spaces in variable, file, and/or observation names, as computers struggle to read blank spaces when tasks are automated. Another good practice in naming variables is to limit the length of the names of variables in datasets; using eight characters or less prevents statistical analysis software from cutting off variable names. Setting variable names this way will also improve the interoperability and reusability of the data in the future in other software.
In many cases, “cleaning” the data may be required before analyses can be performed or data can be shared or deposited, which you learned about in chapters 7 (“Data Cleaning During the Research Data Management Process”) and 8 (“Further Adventures in Data Cleaning”). When cleaning your data, you will also want to create documentation about it, including creating coded versions of variable and/or observation names and an accompanying codebook as a separate document. Spaces in file names or in table headers can cause certain software or applications to crash or can result in errors when trying to open or use a file. For example, in a command line environment, spaces are used as delimiters. To avoid blank spaces, use camel case (StartingEachWordWithACapitalLetter) or underscore (between_words) to create machine-readable codes.
Consider a case where a researcher has conducted a survey of undergraduate students to ask about the costs associated with course materials. This survey questionnaire included the following item: “This past semester, were you enrolled in any courses that involved costs associated with travelling locally around the greater Calgary area?” It would not be useful to label a column in a spreadsheet with this question verbatim. Thus, the researcher may create a coded version, or a shorthand name, such as “TravelCosts,” to substitute as a column header, or variable name, for the full question in the dataset. To keep track of these substitutions, or codes, the best practice is to create a survey codebook in the form of a separate text document that connects the shorthand codes to the full original questions from the questionnaire.
In addition to connecting codes with full variable names or questionnaire items, a codebook can also contain information about missing data and the labels or values of the range of responses to a particular question. For example, if the possible responses to the previous question were “yes,” “no,” and “I’m not sure,” the researcher may use numeric codes with value labels to analyze a quantitative version of the responses. The codebook could contain this information by noting that “yes” is coded as 3, “no” is coded as 2, and “I’m not sure” is coded as 1.
The following table provides an example of how this example survey codebook document might look:
Variable Code | Variable Label (Original Question) | Response Options |
TravelCosts | This past semester, were you enrolled in any courses that involved costs associated with travelling locally around the greater Calgary area? | 3= Yes 2 = No 1 = I’m not sure |
TXTBKCosts | This past semester, were you enrolled in any courses that involved costs associated with purchasing a textbook? | 3= Yes 2 = No 1 = I’m not sure |
Concern | Have you ever expressed concern to a professor about your ability to afford the materials required for their course? | 3= Yes 2 = No 1 = I’m not sure |
If there are multiple variables that have the same response options, such as “TravelCosts” and “TXTBKCosts” in the example above, it is wise to maintain consistent value labels for the response options across the variables to avoid confusion during the analysis phase of the project.
It is also common that research labs or teams conduct multiple research projects on similar topics using similar measures simultaneously. For instance, two similar studies are being conducted at the same time on the impact of workplace violence on employees’ post-traumatic stress disorder (PTSD) symptoms. One study might be about how workplace bullying is causing employees’ PTSD symptoms and the other might be focusing on how client-initiated physical violence is causing employees’ PTSD symptoms. In this case, PTSD symptoms are measured in both studies. To improve interoperability within the research team, it is important to keep consistent naming and coding conventions on the measure of PTSD in both studies. The codebook, as part of the documentation of the dataset that would also ideally include a README file and/or metadata, would be essential when a researcher aims to share or deposit their dataset with other researchers or the public. It would be impossible to use the dataset without knowing the definitions of each variable (for further examples, see the Inter-university Consortium for Political and Social Research’s (ICPSR) “What is a Codebook” resource, which has a concise description and more examples of typical codebook structures).
Naming variables and files and defining quantitative versions of abstract social or behavioural constructs is complex. A key aspect of RDM in quantitative disciplines, including social science, involves determining file naming conventions and file storage hierarchies using a DMP. A DMP is an important project management tool for documenting a file naming convention, especially when working with quantitative data that may incorporate multiple versions of a dataset stored in tabular format with script/code files that may be required for cleaning or analyzing the dataset(s).
The conventions for naming files in the quantitative social sciences do not necessarily differ from other disciplines. It is necessary to incorporate enough specific information to uniquely identify a file and to understand the difference between different versions of the same dataset. For example, it can be important to include “raw” in the name of a file containing data collected prior to cleaning or analysis. Making a copy of the raw file as a working file and maintaining it as the authentic version of the data prior to any intervention is a good practice. The working copy of the data file should have a name that clearly indicates it is not the raw file and also distinguishes it from other potential versions of the dataset (e.g., a version of the dataset that has been cleaned or a version of the cleaned dataset that includes variables calculated from the raw data). Over the course of a project, many files may be created for the same dataset. A DMP can be used to plan for the types of files that may be created and name them in ways that uniquely identify each file. The ICPSR, arguably the most well-known social science repository, based out of the University of Michigan in the United States, has a sample DMP for social science that incorporates advice relevant to the type of data that quantitative social scientists collect and manage.
There are additional considerations for managing quantitative social science related to projects that incorporate a longitudinal design. In a longitudinal design, which is a common method in the social sciences, researchers often collect data or aim to compare data from the same participants over multiple years. This presents challenges in matching the data for a given participant from a given year to the same participant from other years and maintaining data integrity over time and across iterations of various datasets. To complicate this issue, not all participants will remain in a study over time — there will be some degree of drop-off over time and thus the number of participants across years may be inconsistent.
RDM includes practices related to instituting a workflow or process to track how files are merged and the changes between versions of a dataset. RDM also relates to decisions about which version of the file will be shared or deposited in the long term. Should researchers deposit each wave (i.e., each dataset for a specific time period) as a separate dataset with instructions on how to merge the files? Or should researchers share the single merged dataset that incorporates many years? There is no right or wrong answer to these questions. RDM ensures that a decision is made one way or the other, ideally based on which version of the dataset is required to replicate published findings or according to more general disciplinary norms, and documentation is gathered and made available depending on the option chosen by the researcher(s).
RDM Issues Regarding Digital Tools and Software for Quantitative Social Science Data Collection
Survey research is a commonly used and cost-effective method in both qualitative and quantitative research in social sciences. Most survey designs are non-experimental in nature. They are used to describe and estimate the prevalence of a phenomenon and/or to identify specific relationships among various factors.
Information collected using online surveys in social sciences could be sensitive in nature, containing personal information (e.g., age, gender and ethnicity, email address, IP address) and/or personal health information (e.g., self-reported former diagnosis of medical conditions). As stated in the Tri-Council policy statement on the ethical conduct for research involving humans (TCPS 2), it is every researcher’s ethical duty to protect and safeguard their research data and their participants’ information from unwanted and unlawful access. As such, determining the level of sensitivity of research data and the consequential options for active data storage, collection, and analysis, is another key aspect of RDM when working with human participants. For more information see chapter 13, “Sensitive Data.”
However, most of us are not cybersecurity experts. It is extremely difficult to check whether a vendor complies with applicable laws and regulations, whether the vendor has external certified security controls, or whether data are encrypted in transit and at rest. Using institutionally licensed and/or vetted survey solutions for research whenever possible might save researchers from a lot of headaches related to compliance concerns regarding institutional or governmental cybersecurity policies. When preparing a DMP for a quantitative social science project, you have the opportunity to describe the methods for data collection and the tools or software that may be used in that process. This an important aspect of the planning stage and reiterates the utility of DMPs in the context of quantitative social science research.
For example, if you were to procure an external third-party online survey tool (most likely cloud-based), it is important to thoroughly investigate where the server and subcontractors’ servers are physically located. Although some of the cloud-based survey tools might be reputable and secure, their subcontractors’ practices or physical locations (e.g., server located outside of Canada) could still put your research data at risk due to non-compliance with applicable Canadian privacy laws and regulations. If the server hosting the online survey platform is located in the United States, the data stored there are subject to the U.S. Patriot Act. Moreover, some specific funding agreements might prevent research data from being stored outside of Canada. These are considerations that can be reviewed and resolved in advance by using a DMP.
Curating Quantitative Social Science Data for Reproducibility
The final phase in a typical quantitative social science research study involves decisions related to depositing (i.e., publishing) and/or archiving any data that underlie publications stemming from the study. Although disciplinary norms related to openly sharing research data vary across social science disciplines and fields, it is becoming increasingly ubiquitous. In addition, funders such as Canada’s three federal research funding agencies (the agencies) and journals across social science disciplines are increasingly requiring that research data be made available or be deposited in a public repository. However, one driving force behind the push for publishing research data, including any related documentation and/or metadata, is the reproducibility crisis (Turkyilmaz-van der Velden et al., 2020).
The reproducibility crisis refers to the inability of researchers to replicate, or reproduce, the findings of published research. Replication is a key method for verifying the soundness or integrity of research findings. In most cases, the reason that a study cannot be verified through replication is because there is a problem with the original data, the data are not available, or the steps taken in the analysis phase of the study on how to achieve the results using the data were not described well enough (Baker, 2016). Quantitative social science has not been immune to the reproducibility crisis and several high-profile retractions, due to problems or fraud with the underlying data of a publication, have coalesced support for higher levels of transparency in the form of making data available (Figueiredo et al., 2019). For example, in 2015, a seemingly landmark study by two political scientists on political persuasion was published in Science. However, over the course of the following five months, two graduate students who had requested the data for replication purposes discovered evidence of intentional fraud and the publication was subsequently retracted (Konnikova, 2015). There are two popular websites, Retraction Watch and PubPeer, that currently crowdsource the tracking of retractions or concerns related to the data underlying published scholarly research. In this way, the scholarly community is holding itself accountable to produce research that can be replicated.
For quantitative social science researchers, there are several curated public data repositories where data can be published, in addition to ICPSR. They correspond to disciplinary norms related to research transparency and reproducibility and to funder and journal mandates requiring research data to be made Findable, Accessible, Interoperable, and Reusable (FAIR). A sub-collection of the Borealis Dataverse open source software installation is available at most Canadian institutions as an institutional data repository, as part of a broader network of consortium-provided research data management infrastructure resources (e.g., the Borealis implementation supported by the Digital Research Alliance of Canada). Researchers affiliated with these institutions may deposit their datasets with their institutional Dataverse sub-collection. Although open to all disciplines, the Dataverse repository platform was initially developed for quantitative social science data, which means it is well-suited to archive the kind of small tabular files and related script files that are typically produced by quantitative social science researchers.
Depositing data in a public repository is a step towards making research data available, but it is not enough to ensure a study is reproducible or that data are FAIR. Additional curatorial steps should be taken, typically by a librarian or other information professional mediating the deposit for a repository, to convert proprietary file formats, such as SPSS or STATA files, to open formats, such as R or csv. In addition, documentation is required in order to reuse a quantitative dataset or replicate any related findings. Documentation of a quantitative social science dataset may include a description of the study for potential future users, codebook, metadata about the data collection (e.g., any weighting scheme that was used for survey data, the time periods of data collection, any software that was used to collect or analyze the data, etc.), scripts or code required to clean the data or reproduce components of a related publication, and the reuse license or terms of use for the data. Curators should ensure that quantitative social science data and any data collection tools (e.g., a survey instrument) are properly licensed. In the case of quantitative social science, the data collection tools can be as valuable or more valuable than the research data outputs of a project. Researchers who use administrative data (e.g., open municipal data, Statistics Canada data, etc.) should ensure that any open government licenses applied allow for deposit of derivative datasets and whether there are any requirements regarding attribution for the original source of the data.
The most commonly applied metadata schema for social science data is the Data Documentation Initiative (DDI), which includes fields such as sample size, geographic coverage, unit of analysis (e.g., household, individual, etc.), and many more fields relevant to the social sciences. In general, data repositories built for hosting social science datasets will incorporate DDI fields in the data deposit interface and will subsequently produce the machine-readable (e.g., XML) metadata file as an automatic part of the upload process.
Conclusion
Overall, the management of quantitative social science research data involves similar processes, workflows, and considerations to RDM practices regarding other discipline-specific types of data. The distinctive topics related to the lifecycle of managing quantitative social science data involve the particular types of software tools that are used to collect data (e.g., the use of cloud-based digital survey platforms) and the subsequent generation of multiple tabular files in the process of collecting, cleaning, and analyzing the data. The key practical aspects of data management related to quantitative social science typically involve: tracking versions of tabular datasets through the implementation of consistent file naming conventions; naming files and variables with machine-readable text or abbreviations; using a data collection tool that is secure and allows for customizable formatting of survey instruments; and maintaining comprehensive documentation (e.g., a codebook and metadata) to ensure data are as FAIR as possible.
Reflective Questions
- Why is it important to create a DMP for quantitative social science survey data?
- How does the choice of research design and data collection method relate to RDM aspects of a quantitative social science research project?
Key Takeaways
- Descriptive designs aim to explore a phenomenon or observation in order to describe an effect, and exploratory designs aim to explain a phenomenon or observation in order to understand an effect. A DMP can be helpful to establish file naming conventions, folder hierarchies, preparation of relevant metadata and documentation, and a plan for eventual data deposit before you start your quantitative social science research project.
- Most commonly used survey platforms in the social sciences are cloud-based software products. When using cloud-based platforms, consider implications for cybersecurity and participant privacy. During the data collection phase, think about how the spreadsheets should be versioned and named for reuse.
- The reproducibility crisis refers to the inability of researchers to replicate, or reproduce, the findings of published research. In most cases, the reason that a study cannot be verified through replication is because there is a problem with the original data, the data are not available, or the steps taken in the analysis phase of the study on how to achieve the results using the data were not described well enough. This has direct implications for making the data underlying quantitative social science publications available, typically via a public data repository.
Additional Readings and Resources
From Digital Research Alliance of Canada (the Alliance)
- Social science DMP exemplars:
From Consortium of European Social Science Data Archives (CESSDA)
From Data Curation Network
From ICPSR
- What is a Codebook
- Guide to Social Science Data Preparation and Archiving
- Sample Data Management Plan for Depositing Data with ICPSR
For examples relevant to applying RDM in social science contexts, see Emmerlhainz, C. 2020. Tutorials on Ethnographic Data Management. Data in the Disciplines IMLS Grant. https://library.lclark.edu/dataworkshops/ethnography-modules
Reference List
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533, 452-454. https://doi.org/10.1038/533452a
de Vaus, D. (2001). Research design in social research. Sage Publications.
Figueiredo, D., Lins, R., Domingos, A., Janz, N., & Silva, L. (2019). Seven reasons why: A user’s guide to transparency and reproducibility. Brazilian Political Science Review, 13(2). https://doi.org/10.1590/1981-3821201900020001
Konnikova, M. (2015, May 22). “How a gay-marriage study went wrong.” The New Yorker. https://www.newyorker.com/science/maria-konnikova/how-a-gay-marriage-study-went-wrong
Turkyilmaz-van der Velden, Y., Dintzner, N., & Teperek, M. (2020). Reproducibility starts from you today. Patterns, 1(6), 1-6. https://doi.org/10.1016/j.patter.2020.100099
sources of information or evidence that have been compiled to serve as input to research.
a meta-disciplinary category encompassing scholarly disciplines that employ scientific methodologies and approaches to study social, cultural, affective, and behavioral human phenomena. Examples of social science disciplines include sociology, political science, economics, psychology, information studies, and more.
a term that describes all the activities that researchers perform to structure, organize, and maintain research data before, during, and after the research process.
a type of study design concerned with exploratory questions (e.g. what? when? how? where?), which aims at exploring a phenomenon or observation to describe an effect.
a type of study design concerned with causal relationships (i.e. causes and their effects, or questions concerning the "why" of an effect), which aims at explaining a phenomenon or observation in order to understand an effect.
a formal description of what a researcher plans to do with their data from collection to eventual disposal or deletion.
an interval measurement scale refers to numbers that are equally distanced from each other in ascending or descending order and where zero may be a point on the scale (i.e. zero does not mean the absence of a value). Examples include temperature and time. In the case of the Celsius temperature scale, zero refers to the point at which water freezes, but not the absence of temperature.
a ratio numerical scale may increase or decrease according to a denominator rather than equal distances. On a ratio measurement scale, zero is not a point on the scale, but rather, means the absence of a value. Population density is an example of a ratio measure. In the case of population density, zero refers to a place with no human inhabitants.
a type of data that represent discrete categories. Ordinal categorical data are those that can be ordered or ranked sequentially. Examples include course letter grades (i.e. A, B, C, D, F) and Likert scales (5-point scale to measure latent constructs or phenomenon that cannot be observed directly). There are also nominal categorical variables, which cannot be ordered on a scale or in a sequence. These can be dummy-coded and included in a quantitative analysis. Examples of non-scalar categorical variables include gender, race, ethnicity, cities, etc.
a dummy variable is a text or non-quantitative variable that is assigned a number for the purpose of quantitative analyses. For example, a dataset that includes a variable for gender with options such as female coded as a 1, male coded as a 2, non-binary coded as a 3, and prefer not to respond coded as a 4.
operationalizing variables means creating quantitatively measurable definitions of abstract concepts or constructs that cannot be measured directly.
data collected as a part of the process of administering something. Administrative data is used to track people, purchases, registrations, prices, etc.
a format in which information are entered into a table in rows and columns.
not owned by a company.
interoperability requires that data and metadata use formalized, accessible, and widely used formats. For example, when saving tabular data, it is recommended to use a .csv file over a proprietary file such as .xlsx (Excel). A .csv file can be opened and read by more programs than an .xlsx file.
a document that describes a dataset, including details about its contents and design.
special characters reserved by computational systems or languages to denote independent objects or elements.
writing text with no spaces or punctuation while using capital letters to distinguish between words.
a plain text file that includes detailed information about datasets or code files. These files help users understand what is required to use and interpret the files, which means they are unique to each individual project. Cornell University has a detailed guide to writing README files that includes downloadable templates (Research Data Management Service Group, n.d.).
data about data; data that define and describe the characteristics of other data.
a type of study concerned with the effect of time on an outcome. In other words, a study that measures an outcome at more than one point in time. For example, a longitudinal survey design involves repeating the same survey on the same individuals over time to understand changes in attitudes or behaviors.
a computational system that is distributed over more than 2 servers in more than 2 locations allowing for remote access via web browsers or APIs to compute power and/or data storage.
the Natural Sciences and Engineering Research Council of Canada (NSERC), the Social Sciences and Humanities Research Council of Canada (SSHRC), and the Canadian Institutes of Health Research (CIHR) (the agencies) are Canada’s three federal research funding agencies and the source of a large share of the government money available to fund research in Canada.
when software is open source, users are permitted to inspect, use, modify, improve, and redistribute the underlying code. Many programmers use the MIT License when publishing their code, which includes the requirement that all subsequent iterations of the software include the MIT license as well.
a standards-based metadata schema developed for social science data.
the process of employing six core activities: discovering, structuring, cleaning, enriching, validating, and publishing data.
Findable, Accessible, Interoperable, Reusable.