3 Research Methods in the Psychological Sciences

Thomas I. Vaughan-Johnston, Queen’s University

Leandre R. Fabrigar, Queen’s University

Katie Lawrence, Queen’s University



The field of psychology is characterized by a diversity of research questions related to human thought and behaviour. As such, psychology is organized into several distinct sub-disciplines. Although psychological research spans a wide range of different content areas, there is quite a bit of similarity underlying how psychologists go about answering research questions in these different areas. This is not to say that differences do not exist in the research approaches used within different areas of inquiry. However, these differences are in large part variations in emphasis and in the specific tactics used to accomplish research objectives. The broader principles and fundamental empirical strategies guiding psychologists in different sub-disciplines are for the most part the same.

If you find that you struggle to understand some concepts in this chapter, do not worry: these are topics that experts throughout psychology continue to study. Indeed, understanding these concepts takes practice. Recognizing that readers have a varied background in this area, there is a key-word index at the end of this chapter. Further, there are many additional resources to learn more about these topics. Open access (free) supports for statistics basics include Andy Field’s (2019) discoveringstatistics.com, and Daniel Lakens (2019) has a low-cost course titled Improving Your Statistical Inferences hosted on Coursera with financial aid options (https://www.coursera.org/learn/statistical-inferences). These resources are not a substitute for a university course in research methods or statistics, but they can provide supportive background information if you want to build a stronger foundation in these key areas.

The principles and procedures that guide psychologists’ exploration of research questions are what we typically refer to as psychological research methods. The goal of this chapter is to introduce readers to the key principles that nearly all psychologists rely upon when conducting psychological research. Understanding research methods is obviously essential for any student whose ultimate goal is to embark on a career as a research psychologist in either academia or an applied setting. However, it is also important for many non-research careers; for example, many professions require employees to be “consumers” of psychological research. These individuals might not conduct research, but often might draw upon prior research to develop plans of action to help accomplish their objectives (e.g., advertising firms developing product campaigns, managers attempting to resolve conflicts between employees). Indeed, even people making decisions in their personal lives might find themselves needing to be consumers of psychological research (e.g., a parent of a child with behavioural problems considering various intervention plans). Regardless of the setting, it is impossible to be an informed consumer of psychological research without understanding the key principles that guide how research is conducted.

In discussing psychological research methods, this review is based on a series of key steps that a researcher must undertake in conducting any program of research. For ease of presentation, these steps follow a straightforward sequence. This sequence is to some degree a logical progression and, as will be seen, some steps cannot really be undertaken without first completing earlier steps. That being said, the order of some steps can be reversed or even addressed at the same time. To illustrate this design process, a recurring hypothetical example of a research program will be used: how fear and anger might influence aggression.

Key Steps in the Research Process

Formulating Research Questions. The first step to any program of research is formulating the research question. Ultimately, any study is only as useful as the research question it is designed to address. Additionally, as will be seen, many of the decisions made in later stages of the research process are informed by the nature of the question a study intends to answer.

Descriptive versus inferential research questions. When formulating a research question, a first issue to address is whether the goal of the research will be primarily descriptive versus inferential in nature. Descriptive research questions largely focus on describing one or more psychological or behavioural constructs in a given domain of interest. For example, a researcher studying aggression might be interested in the prevalence of verbal aggression in the workplace. This researcher might wish to determine the proportion of employees in Canadian workplaces who have been verbally demeaned or insulted by their co-workers.

Although psychological research is sometimes primarily descriptive in nature, most psychological research is predominantly inferential in its goals. Inferential research involves the exploration of relations among psychological and behavioural constructs. For example, in the context of aggression, a researcher might want to know what characteristics of workplace employees are associated with them being perpetrators of verbal aggression. Clearly, both types of research question (descriptive and inferential) are useful and interesting. However, if we ultimately want to understand why something occurs and/or how we can influence it, research must move beyond the purely descriptive level and begin to address inferential questions.

Exploratory versus confirmatory research questions. Assuming an inferential research question, the next consideration is whether this question will be approached in an exploratory or confirmatory manner. In exploratory research, researchers do not have specific expectations, but rather more general notions regarding the answer to the question. For example, a researcher interested in what characteristics are associated with the likelihood of being a perpetrator of verbal aggression in the workplace might measure a wide range of different characteristics of employees (e.g., their proclivity to experience different emotions, their level of seniority in the organization, various personality traits) and then conduct analyses to see which characteristics are associated with aggression. In contrast, for confirmatory research, the researcher specifies what factors are likely to cause aggression and perhaps even when and why such factors have their effects. These hypotheses are generally derived from past research and/or some theory regarding the phenomenon of interest. The researcher then focuses attention primarily on those factors that have been hypothesized to produce the outcome of interest.

Both approaches have their advantages and limitations. The strength of exploratory research is that it encourages researchers to think broadly about the phenomenon of interest and maximizes the opportunity of stumbling on unexpected discoveries. However, although exploratory studies often consider a wide range of possibilities, they are rarely optimal tests of any single explanation. In contrast, confirmatory studies tend to have a narrow focus, but usually provide more systematic and complete tests of the factors they are designed to explore. For instance, if a study must cover a wide range of different characteristics of employees that could predict their proclivity to engage in verbal aggression, it might not be feasible to extensively measure each factor (e.g., the researcher might only be able to include a few questions measuring each factor). In contrast, if a researcher has explicitly postulated that tendency to experience the emotions of fear and anger are major determinants of aggression, the researcher might be able to include very extensive measures of each emotion, and perhaps even multiple different types of measures of each emotion. The two approaches, however, are not mutually exclusive. Indeed, often a program of research will adopt an exploratory approach in its early phases and then gradually transition to a more confirmatory approach.

Basic versus applied research questions. A final consideration during the research question formulation stage is whether the study will be designed to primarily address a basic (i.e., theoretical) research question versus an applied research question. Basic research is aimed at formulating and testing fundamental psychological principles governing a domain of interest. For instance, a researcher might be interested in developing a theory of the role of emotions in aggression. The goal of this researcher is to develop principles that explain which specific emotions either increase or decrease aggression and why these emotions have the effects they do on aggression. Thus, the goal is to arrive at a fundamental understanding of the relations among the constructs of emotions and the construct of aggression.

In contrast, applied research questions tend to focus on a specific problem. They typically emphasize predicting or influencing an outcome rather than in understanding why that outcome is predicted or influenced by a given factor. Indeed, applied research questions often focus on the effects of a specific measure or intervention with little concern as to why that measure or manipulation accomplishes its goal and/or the effects of the broader construct of interest that measure or intervention is presumed to represent. For example, an applied researcher might be interested in testing if a specific measure of anger predicts employee aggression or if a specific anger-management program lowers employee aggression.

As with other distinctions, the basic versus applied research question distinctions are not mutually exclusive. Often basic research might have the ultimate goal of developing principles that can be used to solve applied problems. Likewise, the exploration of applied questions can often contribute to the understanding of basic questions. Thus, this distinction is more a matter of emphasis than a fundamental difference in the nature of the research question being addressed.  However, this difference in emphasis does have implications for the methodological decisions that a researcher might make at subsequent stages of the research process.

Selecting dependent variables. Once a researcher has formulated a research question and presuming that question is inferential in nature, the researcher’s next step is to determine the specific constructs of interest. More precisely, constructs are those elements in a study thought to vary across people and/or situations. Although the goal of all inferential research is to determine the relationship between constructs, some of this research involves merely finding associations between constructs, whereas other studies test hypothesized causal relationships among the constructs(s) of interest. A researcher cannot assess “fear”, “aggression”, or other constructs directly, but instead selects specific measures that represent constructs in an observable way. Measures representing the outcome constructs in hypothesized relationships are called dependent variables because they are conceptualized to be dependent on the levels of one or more independent variables, a topic that will be addressed later in the chapter.

After having determined the constructs that one intends to study, one must more precisely define them. Some constructs are more easily defined than others. For example, when measuring psychological constructs such as personality, there are numerous conceptualizations of personality, including the Big Five and HEXACO frameworks. In contrast, physical traits such as height and weight often have widely accepted definitions that are consistently applied across domains of research. Keep in mind that how a researcher chooses to define the study variables will affect the results of the study, the comparability of outcomes to other studies that have researched the same constructs, and one’s ability to operationalize the constructs in a way that will allow for feasible, sensible, and meaningful measurement.

For example, there are a broad range of ways to characterize aggression (e.g., Archer & Coyne, 2005). For some research questions, a broad conceptualization that includes indirect, relational, and social aggression may be very useful. In other cases, a very specific definition of aggression as “causing physical harm to others” may be preferable. Even within this seemingly narrowed conceptualization of physical harm, important conceptual questions require answering: for example, should the mere desire or wish to cause physical harm count, or only aggressive actions that are actually expressed by a participant?

Operationalization is the formal term for the specific definition of constructs with a specific measure. For example, if one wishes to measure an individual’s aggression, the experimenter must decide how, that is what method of instrumentation, should be utilized to obtain an accurate measurement (e.g., using a self-report scale, observational techniques). Thus, one possible operationalization of individual aggression could be self-report using the Aggression Scale (e.g., Orpinas & Frankowski, 2001).  Researchers usually hope that they can make inferences from the measure back to the construct that the measure is trying to capture. When operationalizing dependent variables, one must aim to select measures that are sensitive enough that the influence of the independent variable on the dependent variable can be detected. Measures should strive to accurately capture a construct of interest, a topic that will be discussed in detail later as construct validity.

Level of measurement. There are four major categories of measurement level. Nominal scales involve any measure for which scores are given as categorical labels. For example, in our fear/anger and aggression study, we might assess participants’ cultural background (e.g., German, Chinese) as a nominal variable. Notice that nominal scales like this do not imply any rank ordering of the categories. That is, cultures like Germany or China are not options that vary along a single continuum of provided options, but are categories that are selected.

Conversely, ordinal scales provide a rank ordering of the categories. For example, a measure might ask people to rank-order several aggressive thoughts they are experiencing from most to least aggressive. Here the response options are ordered from most aggressive to least aggressive: a single continuum. However, also recognize there is no standard distance between the rankings: that is, the psychological distance implied by the gap between the first and second most aggressive thoughts might not be identical to the distance between the fourth and fifth most aggressive thoughts.

Interval data provides response options that are equally spaced. In psychology it is often difficult to create truly interval scaling. Imagine a self-reported anger scale ranging from 1 (slight anger) to 2 (moderate anger) to 3 (strong anger). The psychological distance between response options such as slight to moderate, versus moderate to strong, although intended to be equal, might not necessarily be equivalent to one another, making it difficult to form truly interval measurements. However, when multiple items are aggregated together, pseudo-interval scaling often functions quite similarly to true interval scaling, and such aggregated ordinal data can often be treated statistically as though it were interval (Harpe, 2015).

Ratio data additionally adds a true zero point. For example, if participants’ punching a doll is used as a behavioural measurement of aggression, zero punches indicate a complete absence of this behaviour. This matters, for example, when multiplying using the scale, for example when comparing between levels on the scale. A 2 on a self-report scale of anger does not indicate “twice” as much anger as a 1, but a person who punches a doll twice has actually engaged in twice as much of this type of aggression compared to someone who punches once.

Methods of measurement. There are methods of measurement routinely used in psychology. The most common method of measurement used in psychology is self-report measurement. These measures ask participants to verbally report their standing on the psychological or behavioural construct of interest, typically using some form of structured rating scale. Self-report tools are usually considered to be direct measures because participants are directly asked to assess their own psychological attributes. Examples include the Beck Depression Inventory (Beck, Steer, & Brown, 1996 ) or the NEO Five-Factor Inventory (Costa & McCrae, 1991). One issue that commonly arises when using self-report measures is that they are susceptible to socially desirable responding (Paulhus, 1991), meaning that respondents may distort their responses in order to present themselves favourably. For example, people may wish to understate how much anger or fear they are feeling, if feeling these emotions strongly is considered inappropriate. Another issue is that people may not always be able to provide accurate self-report responses. For example, self-report responses are influenced by the cognitive accessibility of relevant information (e.g., Strack, Martin, & Schwarz, 1988), making these responses susceptible to influence based on how questions are framed. Additionally, people may simply not have perfect introspective self-awareness (Nisbett & Wilson, 1977), and therefore not be capable of accurately describing all of why they think or feel certain ways.

Another common method of data collection is the use of indirect measures, which refer to tools that assess participants without directly asking them to provide self-assessment of their psychological attributes (De Houwer, 2006; Gawronski & De Houwer, 2014). A very common form of indirect measure is implicit measurement, referring to measures that assess relatively uncontrolled and automatic types of participants’ responses. Examples of implicit measures include the Name-Letter Task (NLT; LeBel & Gawronski, 2009; Nuttin, 1985), the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998), and the Affect Misattribution Procedure (AMP; Payne, Cheng, Govorun, & Stewart, 2005).  Although these implicit measures are quite diverse in form, they generally work by assessing reaction time, or subtle response patterns that would be difficult to deliberately control. For example, implicit measures often assess how quickly people pair objects together, following the logic that similar objects or ideas are “congruent” for respondents, and are easily categorized together. For example, people who pair “good” with “white” quickly, but “good” with “black” slowly may be viewed as having a preference for white over black people. Other implicit measures suggest that underlying feelings about an object can be assessed by how respondents’ feelings spill over onto stimuli presented shortly after. The AMP, for example, exposes participants very briefly to an image of an attitude object (a prime), and then asks them to rate their opinion towards a relatively neutral stimulus (e.g., rating how much they like a meaningless shape). Individuals who rate the neutral stimulus as “bad” after viewing a particular prime are viewed as having a negative opinion of the prime object (Payne et al., 2005).

One reason that indirect measures are often championed is that they are thought to be highly resistant to social desirability concerns (Petty, Fazio, & Briñol, 2012). For example, when measuring racial attitudes with a self-report scale, psychologists may be concerned that respondents would have a powerful motivation not to admit racist attitudes. An indirect measure can subvert these social desirability concerns by measuring extremely subtle reaction time differences that would be difficult to control. It may be noted that some research has identified specific conditions whereby respondents can occasionally control ‘implicit’ responses (Klauer & Teige-Mocigemba, 2007), but generally respondents will find it much more difficult to deliberately control their responses on these tasks. Thus, implicit measures may not completely immune to social desirability or other motivated control attempts, but they are highly resistant to such response biases.

One common observation about implicit measures is that they do not always show high levels of convergence with their explicit counterparts. Although critics have sometimes framed this low convergence as a problem, low correlations may simply suggest that implicit measures capture unique variance in constructs that traditional self-report measures fail to capture. Importantly, this implies that direct and indirect measures may have incremental validity in predicting behaviors, meaning that using both types of measure to predict behavior is more powerful than using only using one type of measure. Reviews have shown that incremental validity of implicit and explicit attitudes can indeed be observed (Friese, Hofmann, & Schmitt, 2008). Furthermore, each type of measure may be uniquely helpful in specific contexts. In conditions where people are deliberate and thoughtful, explicit measures appear to have better predictive power, whereas implicit measures are better used to predict spontaneous behavior (Asendorpf, Banse, & Mucke, 2002).

Oftentimes in psychology, psychological processes are inferred based on physical changes that occur to participants’ brains or other bodily regions. Physiological measures record processes such as voltage fluctuations in brain neurons (i.e., brain activity) captured using electroencephalography (EEG), metabolic processes using positron emission topography (PET), and blood flow in the brain using functional magnetic resonance imaging (fMRI). For example, some researchers   have assessed people’s fear responses by assessing activation of their amygdala region through techniques including magnetoencephalography (Moses et al., 2007). Cacioppo and Tassinary (1990) have chronicled some of the impressive advances in neuropsychology’s ability to noninvasively examine brain activity. Like implicit measures, physiological measures are often seen as preferable to self-report measurement because they can obviate participants’ attempts to control their responses. Although these measures therefore have great value in addressing certain concerns, one general limitation of these methods is that because of the complicated technology required, their administration requires highly specialized technicians, and they are therefore costly and time-consuming to use. More substantively, numerous neuropsychologists have warned readers about the dangers of over-assuming causal relationships between brain “signals” and participants’ emotions, thoughts, or actions (Cacioppo et al., 2003).

Just as implicit and physiological measures operate by capturing respondents’ relatively uncontrollable reactions, observational measures allow social scientists to obtain information from their subjects through evaluating participants’ overt behaviours. Observations can be made with or without participants’ being aware that such observations are occurring. For example, aggression has been measured by measuring how much hot sauce participants put into a glass of water supposedly intended for the next participant to enter the laboratory, with large amounts of hot sauce indicating an aggressive behaviour (Liebermann, Solomon, Greenberg, & McGregor, 1999).

Reliability and validity. A comprehensive explanation of the development of new measures goes beyond the scope of this chapter, but guidelines are available for interested readers (John & Benet-Martinez, 2014; Simms, 2008). The following section instead focuses primarily on issues of measurement reliability and validity, two fundamental psychometric properties.

Although both reliability and validity in measurement are crucial, reliability is required for a measure to be valid, but validity is not required for a measure to be reliable. In principle, reliability simply refers to the consistency with which a measure provides the same information, although it comes in many forms. For example, psychologists may measure the same construct in the same people across a span of time, using the same measure. If a measure provides consistent measurements across time, and the construct it assesses remains stable, people who score low or high at one time point should continue to do so later; this is called ‘test-retest reliability’. Of course, constructs that are expected to change across time (e.g., acute experiences of fear) do not typically get measured with high test-retest reliability, because participants responses change due to the fleeting nature of emotion. However, many traits are thought to be relatively stable across the lifespan, such as personality (Costa & McCrae, 1993), and high test-retest reliabilities serve to indicate that these constructs’ measures are providing consistent information.

Another tool for assessing reliability is the extent to which independent evaluators judge something in a similar manner: ‘inter-rater reliability’. For example, if observers were asked to evaluate aggressive behaviour displayed by participants, inter-rater reliability would be high if all the judges observed and recorded a similar number of aggressive behaviours. If judges’ evaluations completely differed from one to the next, this would be evidence that their observations lack reliability, that is, lack consistency. Similarly, when evaluating various items that are thought to assess the same underlying construct, ‘internal consistency’ refers to when items correlate highly with one another due to respondents answering in a consistent way across items (Henson, 2001). For example, a highly fearful individual should express that they are “terrified”, “frightened”, as well as “scared”. The core principle is consistency: consistent responses to these items within the same respondents would indicate that the items are seen as reflecting the same construct, meaning that they have reliability.

After having operationalized your dependent measures it is important that you ensure that your measure displays validity. A measure is valid insofar as it quantifies accurately what it purports to measure. Construct validity refers to the degree to which a measure specifically and sensitively captures its intended construct (Cook & Campbell, 1979; Shadish, Cook, & Campbell, 2002). Although methodology texts often introduce dozens of unique types of validity as though each were completely separate, many of these are best viewed as types of evidence that allow researchers to determine if a measure has construct validity. For example, ‘criterion validity’ is the extent to which a measure is associated with other measures that should logically be related to its construct. This is really evidence of a measure’s construct validity: if a measure effectively captures its construct, it should be related to things that its construct relates to. For example, when developing a self-reported fear measure, this fear measure should be related to avoidance behaviors, because people are motivated to avoid things that frighten them. If they do correlate, this is consistent with the notion that the fear measure is accurately or validly measuring fear. Similarly, methodologists refer to ‘discriminant validity’ when a measure shows minimal associations with irrelevant variables. For example, a fear measure should not be closely associated with social desirability measures. Indeed, if a fear measure was negatively related to a social desirability measure, it might indicate that people are denying any fear that they feel due to social desirability concerns such as not wanting to sound afraid. This would threaten a fear measure’s construct validity, because the fear measure would no longer only be measuring fear.

If a measure appears to reflect its construct according to either experts or laypeople, then it is said to possess ‘face validity’: once again, this is evidence of construct validity. If emotion experts think that the items on a fear measure are not reflective of fear, this could raise concerns about the measure’s construct validity. Interestingly, sometimes it is disadvantageous for a measure to possess face validity. For example, if participants are aware that a scale seeks to measure aggression, then it is likely that participants may disagree with items to appear non-aggressive to the extent that aggression is socially inappropriate or anti-normative. To obtain accurate results it is therefore occasionally advantageous to reduce face validity depending on the construct of interest, in other words increasing a scale’s subtlety (Holden & Jackson, 1979).

Selecting independent variables. Once the dependent variable has been determined, a researcher selects one or more independent variables (IVs), which represent variables conceptualized as predicting or influencing DVs. Many of the same criteria used to evaluate DVs are also relevant when considering IVs. For example, the reliability and validity of IVs are as important as they are for DVs, and are often assessed in the same ways. Continuing with the example of fear or anger inducing aggression, fear/anger would be IVs: the variables understood to be increasing or decreasing aggression. However, IVs are not precisely like DVs. For one thing, DVs are always measured whereas IVs may be measured or manipulated. Both measurement and manipulation have some advantages and disadvantages, and each opens up several specific questions for the researcher.

Manipulations. Manipulations are changes in constructs induced by deliberately stimulating or inhibiting those constructs through some process of the study. In the recurring example, a manipulation would be any action designed to actively change participants’ current levels of anger or fear. As with DVs, consider the many ways that fear/anger could be operationalized. One could remind participants of a time when they felt fear/anger in their own lives (recalled emotion; e.g., Baker & Guttfreund, 1993) or read fictitious narratives which are intended to make participants experience fear/anger (emotion stimulated by narrative engagement). One could employ deception to generate anger: for example, Nisbett and Cohen (1996) had a confederate “accidentally” bump into participants as they walked in a corridor, which elicited anger in participants. Despite being very different, these are all manipulations designed to stimulate an IV.

One reason to incorporate a manipulation rather than a measure of one’s IV is that manipulations have advantages with respect to internal validity, which reflects researchers’ ability to make causal claims about the relationship between study variables. Imagine measuring fear (our ‘IV’) and then measuring aggression (our ‘DV’) just a few moments afterwards. Assuming an association existed between these measures, what could a researcher conclude? It is not clear that fear caused aggression. One other possibility would be that participants were already feeling aggressive before fear was measured. Those aggressive intentions caused the participants to feel fear, and were still present when the aggression measure was collected. Thus, in this case, fear might just as easily have caused aggression (this risk is sometimes called reverse causation). Perhaps more likely, a third construct could be responsible for causing the other two constructs to appear associated. For example, participants may have been experiencing physiological arousal at an earlier point in the procedure. This arousal caused them to endorse the fear items because their heart was racing and their palms were sweating, so they inferred that they were feeling fear. Furthermore, their arousal led them to behave more aggressively. Note that in this case, arousal was actually responsible for both variables seeming to ‘increase together’ (covary), and no real causal relationship existed between fear and aggression. This threat to internal validity is sometimes called the third variable problem.

These are questions of, and perhaps serious threats to, internal validity. Now imagine randomly assigning half of a group of participants to watch a frightening movie scene that results in increased fear, and the other half to watch a non-frightening scene that doesn’t increase fear (thus, fear is manipulated). That is, every participant has an equal likelihood of being in any of the experimental conditions. Because people are randomly sorted into these groups, it is unlikely that a third variable caused differences in fear between the two groups. This is because any idiosyncratic individual differences between participants would be distributed randomly across conditions. Instead, differences between the groups are most likely attributable to the manipulation’s effects, helping to establish a causal relationship wherein the IV causes the DV. Researchers’ ability to make such causal claims are referred to as internal validity.

A common choice when using manipulations is to incorporate a control group, representing the condition in which participants would be if they were not subjected to the part of a manipulation that is of interest to you. For example, consider all the elements of watching a five-minute frightening film clip: five minutes of audio and visual stimuli, the feeling of wearing headphones, sitting in a chair, and (hopefully) feeling fear. A control group controls for as many of these irrelevant aspects as possible, leaving only the fear variable to differ across groups. Thus, a control group might watch a five-minute film clip (wearing headphones; sitting down) of an emotionally ‘neutral’ scene such as a mechanic fixing a dishwasher. Differences in group behaviors are now hopefully attributable only to fear, rather than sitting, wearing headphones, or film-watching in general, since even a boring dishwasher scene contains all of those elements.

This clustering of participants such that some experience one condition, others experience a different condition, and others experience a control condition is characteristic of a between-participant design, which helps to examine causal relationships by randomly assigning people to one of two conditions and examining differences emerging between the groups. Alternatively, in a within-participant design, participants would each undergo each condition. Re-using the video-watching example, a within-participant design might have all participants watch both clips, measuring aggression after each clip. In this case, no random assignment is required because the same individuals participate in both conditions. However, a researcher will often rotate the order of presentation: half of participants watch the control film before the frightening film, and half watch in the reverse order (this process is sometimes called counterbalancing the order of conditions). Otherwise, the order of film presentation might explain any differences between conditions.

Issues of manipulations and measurements. It is often advisable to consider a similar checklist of priorities when using measures or manipulations. Consider issues of confounding variables. One common objection to measuring IVs is that measures are almost always influenced by constructs other than the one intended. For example, it may be difficult to measure fear without a measurement being impacted by participants’ neuroticism (a personality trait in which people experience chronic, negative emotionality). Therefore, manipulations may seem superior because they do not introduce such confounds. However, manipulations may also introduce irrelevant confounds if the manipulation influences constructs other than the one(s) intended (see Fiedler, Kutzner, & Krueger, 2012). For example, a manipulation designed to increase fear might almost make some participants sad, angry, or surprised, making it harder to deduce what was ultimately responsible for any aggression effects. Thus, whether a researcher measures or manipulates an IV, they should still consider how irrelevant variables may interfere with their study’s validity.

Second, issues of transparency, the degree to which participants can understand the true purpose of a study, are relevant to both measured and manipulated IVs. For example, it is usually important that participants do not know the precise hypothesis of a study, lest they simply act as they believe they are supposed to (i.e., demand characteristics; Orne, 1962). Suppose a study consists only of measuring fear and anger, before measuring aggression. Participants may deduce that the researcher wants to know whether fear and/or anger predict aggression, and act accordingly (acting either to confirm or disconfirm that hypothesis). One way to avoid this problem is to use one of many measures that are designed to measure a construct subtly, to avoid being obvious about what the experimenter is interested in, as discussed above. Another easy solution is to includefiller measures: scales that researchers do not wish to evaluate, that are included to confuse participants’ understanding of the study’s purpose. Participants will typically assume that all study measures are relevant to the experimenter’s research questions, and therefore these bogus measures will throw off their guessing the true hypothesis.

In some contexts, manipulations may also make the study’s purposes transparent. If participants understand what a manipulation is meant to do to them, they may act differently due to their awareness of the experimenter’s research goals. Transparency is a particular issue for within-participant designs, because these often imply to participants that the experimenter wants to know how something varies across conditions, each of which each participant has experienced. In between-participant designs, in contrast, the design is often well-hidden simply because participants are not aware of what other participants are experiencing and thus do not know what their responses/actions are being compared against. One precaution that is often sensible is to include a funnel interview (Page & Scheidt, 1971). In a funnel interview, participants are asked increasingly probing questions about their experiences in the study and what they thought the study’s purpose was. Participants who truly understood the study’s purpose will presumably state this when they are asked, and researchers can consider whether to refine the manipulation, cut the data of the suspicious individuals, or else simply run statistical tests with and without suspicious participants included to assess the impact of suspicion.

The concept of construct validity was previously introduced with reference to measurements, but it has applicability to manipulations as well. Consider the previous example of bumping into participants to produce anger. In reality, it was primarily participants who were raised in the Southern, not Northern U.S. states who felt anger at the staged hallway collision (Nisbett & Cohen, 1996); Northerners quite often felt amused by the experience. This raises a critical question: for whom is a manipulation likely to activate its intended construct? The same stimulus that would frighten a child may not produce fear in adults. The easiest way to determine if a manipulation has construct validity is a manipulation check performed either during the study, or on a separate pilot sample. A manipulation check usually asks a participant a question directly related to the construct: for example, after watching a (hopefully) scary film clip, participants may be asked “how scary was that film?” or “how scared are you?” If the fear clip is felt to be scarier than the control clip, elevated fear ratings should be produced.

Context. We next consider elements of research context that a research must consider when planning a study. In social science, context generally describes the population of interest (people) and the location and time (setting) in which research takes place. Context is of great importance to psychologists for at least two reasons. First, context helps to define how measures and manipulations should be designed to optimally capture a construct (i.e., construct validity). Just as some measures are only effective for children (e.g., “I want my mommy” as an item measuring fear), some stimuli have different psychological meanings in certain eras. For example, consider how the meaning of the name “John F. Kennedy” changed from 1962 to 1964 (with his assassination occurring in 1963), or how the words “John F. Kennedy” might have radically different meanings to a respondent who was alive in the 1960s compared to a respondent who was born in the 21st century. This is very important in psychology, because it means that measures and manipulations that were developed originally for one context may or may not work effectively in other contexts. Ultimately, psychologists are interested in the relationships between constructs, not measures. Therefore, materials must be found to possess construct validity within a given context and within a given population before they can reasonably test how constructs interrelate. There is often a trade-off to consider. Materials that are very customized for a specific population may be extremely powerful tools for studying that population, but may require a serious re-evaluation and development process when alternative groups are studied, making generalization attempts more laborious.

A second reason why context and population matters is because psychologists sometimes wish to test the external validity or generalizability of findings. Suppose that psychologists discover that fear does causally produce aggressive responses among children. Of course, it does not automatically follow that the same relationship would occur among adults, whose emotional self-regulation abilities may be considerably different. Assuming that a construct-valid fear manipulation was employed among adults, and assuming that a construct-valid aggression measure was also used, the fear/aggression association could be examined among adults as well. Whether the association emerges or not would then test the external validity of the fear/aggression link, that is, how generalizable the link between variables is.

Participants. In psychology, the population of interest is typically a very large group of people about whom the researcher wishes to draw conclusions. Researchers create inclusion criteria and exclusion criteria to aid in the process of defining the population of interest. The former refers to characteristics that would render a participant eligible to participate and the latter would disqualify a subject from partaking in the planned data. For example, if a social scientist was interested in the aggression levels of criminally convicted juvenile offenders in Canada, then the inclusion criteria might include age (<18 years). Having no criminal record would be an exclusion criterion.

Measuring every individual in the population of interest is virtually never feasible (Banerjee & Chaudhury, 2010), requiring psychology researchers to test their hypotheses using a subset of the population of interest known as a sample. In some cases, researchers aim to obtain a truly random sample, which ensures that every member of the population under investigation has an equal probability of being included in the sample. One situation in which random sampling is important is when descriptive analyses are important to a researcher. For example, if researchers want to know accurately what the average aggression level is among Canadian juvenile offenders, non-random sampling will likely undermine the accuracy of their descriptive estimates.

Truly random samples are often impossible to obtain (Sweetland, 1972) resulting in the collection of data by means of a convenience sample, meaning that a sample is obtained from a more readily available subgroup of the population. University students are a classic example of a convenience sample when the population of interest is “all people”, because students are often easily accessible to researchers, for example participating in research in exchange for bonus marks in their courses or small cash payments. Naturally, university students differ from random members of the public in some respects: they are likely to have elevated intelligence, an increased desire for thinking, and so on. However, a worthwhile consideration is whether a convenience sample differs from the population on specific constructs of interest to a researcher. For example, a perceptual psychologist studying visual perception may consider university students to be quite representative of people with respect to rods and cones in their retinas. To this researcher, the attributes for which university students might be expected to differ from the general population probably would not interfere with testing their key hypotheses.

Other cases may be more ambiguous, and the utility of convenience samples may also depend on the type of research question being pursued. For example, if university students have unusually developed cognitive abilities, this is likely to bias descriptive research questions about cognitive abilities. Inferential research, however, necessitates closer scrutiny regarding the use of convenience samples. For example, it is unclear whether a convenience sample of university students may have a different relationship between fear/anger and aggression, compared to children or older adults. That is, the relationship between emotion and aggression (an inferential question) may itself differ across a span of age levels. One possibility, if a researcher is concerned about such age effects, would be to collect a representative sample. However, it is not clear that this solution is without issues. For example, suppose that fear relates to increased aggression in young adults, but that children instead become less aggressive when they are afraid. If a researcher were to engage in equal sampling of children and young adults, the study might show no effect of fear when in fact there are two very different effects that are masked because the two patterns run in opposing directions. Indeed, if researchers have reasonable grounds to suspect that such differences occur across sample types, they may want to conduct multiple studies, each collecting a sample from a different population. In this hypothetical case, for instance, Study 1 would identify the positive fear/aggression association among young adults, and Study 2 would identify the negative association in children. An alternative approach would involve deliberately collecting both groups within a single large study (e.g., half young adults, half children), and then statistically analysing any differences across the groups.

Another consideration regarding population is sample size, that is, the number of participants who will participate in a study. There exist numerous techniques to determine an appropriate sample size, usually termed power analyses, but the mathematical basis for these calculations is too complex to be fully advanced here. In general, larger samples decrease the chance that a finding will represent a statistical “fluke”. This is because as our sample becomes bigger, it better approximates the population that we want to make conclusions about. For example, if 10,000 Canadian women were surveyed about workplace aggression, the conclusions that could be drawn about experiences of Canadian women related to workplace aggression are more likely to reflect the population of all Canadian women than a sample size of 10 Canadian women.

Although some psychologists advocate for always maximizing sample size, there are a few issues to consider when deciding on an appropriate sample size. Certainly, it is true that a larger sample size increases statistical power, or the ability to detect inferential patterns between variables where they truly exist. Similarly, descriptive statistics become more precise with larger samples. However, there are other considerations to take into account when planning research. For example, researchers may become constrained in terms of the methodologies that can facilitate such enormous samples. For example, researchers can collect thousands or even millions of participants through crowd-sourcing techniques or mass online testing (e.g., www.yourmorals.org; Iyer, 2019), but as we detail in a later section, online research has both advantages and disadvantages associated with it.

A final issue prompting close attention to population is how stimuli and measures will be developed for various populations. As previously discussed, scientific research proceeds by using measures and manipulations to operationalize abstract constructs. Thus, it is imperative that measures/manipulations have their intended meanings within each specific population. Consider using the same religious questionnaire for a study in both San Antonio and Salt Lake City: religious items may not have the same meaning for both populations. Some methodologists advocate for measurement invariance analysis (Millsap & Meredith, 2007; Widaman & Grimm, 2014), which uses a mathematical procedure to establish whether items of a measure perform similarly across groups at a psychometric level. Without establishing, at minimum, the basic levels of measurement invariance, comparisons across groups become suspect. Using the above example again, it becomes problematic to replicate a study on Texans with a sample of Utahans if a central measure has a completely different psychometric structure for these two groups.

Setting. A major factor in setting is whether a study takes place in a laboratory, in an online survey, or in a field context. The advantages and disadvantages of these contexts have stimulated productive research and debate. For example, laboratory research has sometimes been criticized as lacking mundane realism, or being artificial and lacking applicability to “real-world” situations (Ilgen & Favaro, 1985). However, psychologists rarely attempt to produce contexts that resemble “the real world” literally, instead focusing on participants’ experiences of a study as psychologically meaningful (Berkowitz & Donnerstein, 1982). Recall that construct validity, for example, depends upon measures and/or manipulations being able to capture or produce psychological constructs within participants, such as fear, anger, or aggression. For example, a social rejection experience may be quite fabricated and artificial, but if it feels real to participants then causal hypotheses about the effects of feeling rejected can still be evaluated. Similarly, one might be concerned that participants will know they are being studied in a laboratory and therefore act unusually due to being observed. However, this risk can often be managed. Many experiments use deceptive procedures, or between-participant designs that hide the other conditions from participants, to disguise the true purpose of the research. For example, studies of bystander apathy examine how participants respond to emergencies (Latané & Darley, 1970). Although psychologists cannot ethically place people in real emergencies, they can lead participants to believe that they are attending a lab for one purpose, and have a simulated emergency occur, such as a person crying out in pain from an adjacent room. When participants intercede, they believe they are responding to a real emergency disconnected from the experiment, and so concerns about participants “feeling studied” can sometimes be controlled.

Practically, the laboratory offers many important advantages to researchers, such as the ability to control noise variables like time of day, temperature, noise and distractions, and so on. Although a variable like ‘temperature’ may not immediately seem important to a psychologist, note for example that room heat has been associated with aggression (Baron & Bell, 1975). Seemingly irrelevant environmental variables can directly influence psychological processes. Furthermore, lab equipment such as physiological equipment, or computers that can assess reaction time, can be made available in a laboratory with relative ease. However, a disadvantage is that some kinds of experiences are not easily cultivated in a laboratory. For example, although psychologists may study group formation in a lab, it is more difficult to study long-term group identity processes within a single-hour lab study, and impractical to have participants attend a laboratory for the years or decades required for some processes to unfold. Similarly, topics such as serious romantic relationships, bereavement, and so on, may be difficult to emulate in a laboratory and may be better studied in their natural contexts.

Although not overcoming all challenges associated with laboratory studies, one alternative context to the traditional laboratory is to conduct research in an on-line setting. There are several advantages to this setting. It is relatively easy to solicit large samples of participants, particularly when using crowd-sourcing technologies such as Amazon Mechanical Turk or Crowdflower. Furthermore, very rare (e.g., individuals with low-prevalence conditions) or distal groups (e.g., when an American researcher wishes to study Japanese populations) are much easier to obtain using online research. However, critics have suggested that attention levels may waver online, especially among university participants completing research online (Hauser & Schwarz, 2016). Others have argued that this “online inattention” problem may be obviated with attention checks (Goodman, Cryder, & Cheema, 2013; but see Hauser & Schwarz, 2015). Certainly, online studies tend to involve participants who know that they are being studied, and so the above-noted concerns about presentation biases may be a concern here once again. With respect to the control that psychologists have over respondents’ environments, the answer here is mixed. For example, an online study can request that participants work in a private, uninterrupted work environment, but can rarely enforce this behavior within participants. Similarly, numerous random variables will fluctuate across participants in online samples. Variables such as room temperature, density of people within the room, and background noise, cannot be directly controlled. Additionally, online research may constrain researchers in their choice of measures and manipulations. For example, researchers can have online participants interact socially in web forums or chat rooms, but many aspects of social interaction (e.g., physical presence, nonverbal communication) are hard to capture in online studies. Similarly, some measures (e.g., physiological) may be impossible to obtain in online contexts, again restricting the sort of research that psychologists can pursue in this format.

Finally, some psychologists have argued for the benefits of field research, often protesting the apparent decrease in field studies in recent psychological science (Cialdini, 2009). Field studies do offer some advantages, such as making it typically quite easy to disguise a study’s purpose. For example, field studies in which subtle aspects of an environment are altered, such as changing the signs present in a neighborhood and observing the results, will allow participants to be unaware that they are being studied, and therefore permit an authentic assessment of their reactions. However, a drawback to field research is that, although external behaviors can be easily detected and studied, internal processes such as participants’ private attitudes and emotions to stimuli can be difficult to assess in this setting. Another potential drawback of field research is that many environmental factors that are easy to control in laboratories (e.g., temperature, wind, the presence of passersby) may be much more difficult to standardize and regulate in field settings. Planning and careful attention to such factors can partially mitigate these risks, but the likely increased instability of noise variables in field research can interfere with inference testing.

Different contexts of data collection (in-lab, online, field, etc.) all carry certain advantages and disadvantages. One alternative to selecting one method and accepting all of the relevant drawbacks, is to conduct multiple studies using multiple methods. For example, a researcher might begin by testing anger’s relation to aggression using a laboratory experiment, using university students; then perform a similar test using a large sample of online participants who vary more widely across demographic variables; and then conduct a field study in which anger’s relation to aggression is monitored covertly (e.g., in a workplace setting).


Once a study is completed, the final steps in the research process are the analysis of the data, the interpretation of the results, and the report of the findings. In psychological research, the vast majority of studies involve data that are quantitative in nature. Quantitative data refer to information that is expressed in some numerical form. For example, people’s responses to a 7-point rating scale indicating the level of anger they are currently feeling might be represented by whole numbers ranging from 1 to 7. Once the data are collected, the researcher must formulate a statistical analysis of the data that corresponds to question of interest.

If the goals of the study are purely descriptive in nature, analysis typically involves the computation of descriptive statistics for the measures of interest. Descriptive statistics summarize the overall pattern of responses for a given measure within a sample. The two most common types of descriptive statistics are indices of central tendency (i.e., indices of the single response that best characterizes the sample as a whole; e.g., the average of anger ratings in a sample) and indices of variability (i.e., indices of the extent to which responses are very similar to versus different from one another in the sample; e.g., the range of ratings of anger in a sample).

However, as noted earlier, most psychological research involves inferential research questions (i.e., questions regarding the relationship between two or more psychological or behavioural constructs). In these cases, a variety of inferential statistics are available to researchers. The specific type of inferential statistic that will be most appropriate for addressing a given research question depends on a number of factors. A detailed discussion of these different types of statistical tests obviously goes well beyond the scope of this chapter. However, in a broad sense, there are several factors that guide a researcher’s choice of statistical tests. First, the nature of the relationship being explored is an important consideration. For example, is the researcher only interested in a relationship between two variables?  Alternatively, is the researcher interested in the relationships of several independent variables to a single dependent variable, or perhaps the relationships of multiple independent variables to multiple dependent variables? Second, what is the scale of measurement for the variables to be analyzed? Are they purely nominal-level variables, purely interval level, or a mixture? Finally, what are the distributional properties of the variables? Do scores on the variables reflect a normal distribution?  Depending on the answers to these sorts of questions, some types of analyses will be more appropriate than others because they make more or less assumptions about these properties of the data.

Although researchers have a vast array of different types of statistical tests from which they can choose, far and away the most commonly used statistical tests are based on the concept of Null Hypothesis Significance Testing (NHST). Simply stated, these tests assess the hypothesis that the relationship of interest does not exist in the population. Tests are considered to be statistically significant when they produce a probability value equal to or less than .05. Statistical significance at the .05 level indicates that the data obtained are statistically different from those expected if the null hypothesis were true, and this difference is less than 5% likely to be due to chance alone. In these cases, the researcher is said to have rejected the null hypothesis (i.e., rejected the hypothesis that the relationship does not exist in the population).

Tests are considered “non-significant” when they produce a probability value (p) greater than .05. That is, a test is considered to have provided insufficient evidence for the existence of a relationship if there is a greater than 5% probability that the observed relationship could have emerged simply due to chance. In such cases, the researcher is said to have “failed to reject the null hypothesis”.

When an analysis of a study has produced an accurate conclusion regarding the existence of relationship between variables, the study is said to be high in statistical conclusion validity (see Cook & Campbell, 1979; Shadish et al., 2002). Conceptually, there are two forms of errors that a researcher can make with a statistical test, thereby leading to low statistical conclusion validity. A Type I error is when a researcher falsely concludes that a relationship exists (i.e., incorrectly rejects the null hypothesis). Traditionally, researchers have considered this form of error to be very serious and set their level of risk for making such an error in their statistical tests (referred to as the alpha level) at .05. Recently, some researchers have called for even stricter alpha levels as a means of enhancing the statistical conclusion validity of psychological research (e.g., Benjamin et al., 2017). A Type II error is when a researcher falsely concludes that there is no evidence for the existence of a relationship (i.e., incorrectly accepts the null hypothesis). Although traditionally researchers have placed less emphasis on this form of error, researchers have considered this form of error to be problematic and have traditionally set their level of risk for making such an error in their statistical tests (referred to as beta) at .20. This means that researchers try to collect enough data that the risk of mistakenly concluding that no relationship exists (when a relationship actually does exist) is no greater than 20%.

Methodologists have identified a number of potential threats to the statistical conclusion validity of research (e.g., see Cook & Campbell, 1979; Shadish et al., 2002). For example, the validity of a statistical test can be undermined if the underlying assumptions of the test are violated. For example, many tests assume that interval or ratio-level measures follow a normal distribution. Other tests assume that each set of observations comprising the sample are independent of one another (e.g., that the responses provided by one person in the sample are not in any way related to the responses provided by another person in the sample). Researchers may sometimes remedy such problems by selecting a statistical test with less stringent assumptions.

Other threats to statistical conclusion validity reflect more fundamental and sometimes perhaps even more intentional errors on the part of researchers. Concerns regarding these sorts of errors have received a great deal of attention in recent years and have lead some researchers to call for major changes in the way psychological research is conducted (Lilienfeld, 2017; Lilienfeld & Waldman, 2017). One issue of concern has been the fact that many studies conducted in psychology have insufficient statistical power. Statistical power refers to the probability that a study will correctly reject the null hypothesis. Traditionally, statistical power has primarily been a concern with respect to Type II error (e.g., Cohen, 1988). However, recently methodologists have noted that in the context of a single study, because studies with low power tend to be more likely to produce anomalous results, low power can sometimes also lead to Type I errors (e.g., Button & Munafo, 2017).

Another issue that has generated a great deal of interest is a set of practices known as QRPs (Questionable Research Practices: see John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011). QRPs cover a wide range of data collection, analysis, and reporting practices, most of which are considered problematic because they can undermine the statistical conclusion validity of a study. Some of these practices involve incomplete reporting of results. For example, a researcher might conduct analyses on multiple dependent variables, but then only report the results for dependent variables that produce significant effects or conduct multiple different types of analyses on a single dependent variable and only report the analysis that produces a significant effect. Similarly, a researcher might conduct a study involving multiple experimental conditions, but then only report the results for those conditions that produce significant differences. Alternatively, a researcher might conduct multiple studies and then only report those studies that produce a significant effect.

Other practices involve changes to the data set itself or the manner in which it is analyzed. For example, a researcher might decide to drop participants from a data set based on whether their deletion strengthens the key effects of interest in a study. Alternatively, a researcher might gradually add data participants to an existing data set and base their decision to stop adding participants solely on when the addition of participants produces a significant effect.

As these examples illustrate, many QRPs are practices that are intended to produce a significant effect, without any clear justification beyond the fact that they produce a desired outcome for the researchers, who may be motivated to identify a significant effect. As such, these practices can inflate Type I error rates. Indeed, although each practice can potentially undermine statistical conclusion validity on its own, the risk is even greater when several of these practices are performed in conjunction with one another (Simmons et al., 2011).

In short, numerous issues have been raised about how psychologists conduct aspects of research, and they are often accompanied with guidelines for improving the statistical validity of research. However, other commentators have suggested that there are problems inherent in NHST as a scientific tool and that no set of reforms to current practices will ultimately be successful in addressing these limitations. These commentators have argued that alternative statistical approaches are required. For instance, some have proposed that traditional statistical tests be abandoned in favour of reporting effect sizes and their corresponding confidence intervals (e.g., Cumming, 2014; Schmidt, 1996). Others have advocated use of Bayesian statistics (e.g., Wagenmakers et al., 2017). Space restrictions preclude a discussion of these alternatives and to date neither has gained widespread acceptance in psychology. However, psychologists continue to debate their potential advantages and disadvantages. This remains an important area of research for experimental and quantitative psychologists, as well as individuals who interpret and make policy recommendations based on research that is based fundamentally on statistics-based inferences. Practitioners in the field should remain familiar with developments in this evolving area as these decisions have important implications for the application of existing research.


The previous sections have primarily explained social science methodology with the goal of maximizing the reliability and validity of research findings. However, psychologists must balance their interests in obtaining reliable, valid results with several important ethical guidelines that establish how research should be conducted. Indeed, one can imagine scientific studies that could be highly reliable and valid, yet ethically egregious. For example, if a researcher was interested in the effects of socioeconomic status on aggressive behaviour, it would be methodologically sound to randomly assign children at birth to adopting parents who are poor or wealthy. However, such a study would obviously be considered ethically problematic.

Although researchers have debated and discussed many aspects of research ethics for decades and specific guidelines and procedures vary as a function of locales and disciplines, three fundamental principles of research ethics tend to be emphasized in nearly all systems. These three core principles are a mandate to give participants information sufficient to allow informed choices about participating or not, minimizing harm to participants, and maintaining the privacy of participants’ responses.

Informed consent is the principle that participants should have a reasonable understanding of what they will be expected to do in a study, and the likely benefits/harms that may affect them. For example, participants should know if research may cause them harm (including physical, emotional, financial/professional, interpersonal, or yet other kinds of harm), and how much of their time is being requested as participants. Additionally, participants should be informed in advance about issues including whether their data will be confidential and/or anonymized (see below), or whether information about them will be obtained from sources other than themselves (e.g., from their academic transcripts). The point is that participants’ consent to participate in research is only meaningful if they know what they are being asked to do.

One potential challenge to informed consent is the fact that some psychological questions are best pursued by partially or fully misleading participants about aspects of the research. For example, when researchers wish to covertly monitor participants’ aggressive behaviors, it may undermine the unobtrusive nature of this measurement if participants know they are being watched. Similarly, some indirect measurements rely on participants not being aware of what is being measured, and in some cases the measure may be undermined if participants realize what is being measured. In other cases, participants are given false information about society, the actions of other participants in the experiment, the purpose of a study (often provided as a cover story in which researchers create a fictitious purpose of the research), or about the participant themselves (e.g., falsely informing participants that they have poor intelligence).

Deception is sometimes considered acceptable provided it is necessary to effectively study the question of interest when a debriefing document or other method is used to inform participants at the end of a study. A debrief will often contain several elements, such as an explanation of what the truth is (e.g., what the real purpose of a study was), and why deception was considered necessary. Because this new information may alter participants’ willingness to have participated, in some contexts it may be appropriate to give participants a second opportunity to consent to the research study. For example, returning to the example of covert monitoring of aggressive behavior, a researcher might reveal this covert monitoring at the study’s end, and offer to delete the recording if the participant does not consent to the researcher keeping this data. After all, they originally consented without knowing that such data was to be collected. The lack of initial disclosure may be necessary because the monitoring would not be covert if participants were warned about it when they first consented.

A second principle is the minimization of harm. That is, participants’ exposure to loss, pain, and/or damage should be reduced as much as possible. Some studies may necessitate some use of harm, such as when participants are given painful shocks to elicit anger (e.g., Berkowitz & LePage, 1967). Minimization of harm would here involve careful scaling of the shock: it must be painful (enough to elicit anger), but no more painful than that (to minimize participants’ suffering). When possible, researchers should highlight ways in which participation can serve as a growth opportunity, such as a chance to better understand themselves, rather than as harmful. In addition to the ethical, this also has a practical benefit: participants who see research and researchers in more positive terms are presumably more likely to understand the importance and value of research in psychological science.

Turning back to the recurring example, a researcher who wishes to induce fear in a participant should aim to have participants experience fear only for as long as is necessary to test a research question. Fear is usually considered a negative, uncomfortable emotion, so while researchers can ethically study fear they should also try to respect participants’ needs. For example, researchers may end the study with a positive emotion induction (Westermann, Spies, Stahl, & Hesse, 1996) to reverse the harm. Also, consider deception in the context of minimizing harm. We previously highlighted the potential issue of deception with informed consent, but there is also a risk of deception causing harm: participants may feel foolish for ‘falling for’ a deceptive manipulation. Thus, it may be advisable to remind participants that most experiments find only tiny suspicion rates: almost everybody ‘falls for it’, so participants should not feel embarrassed. It is possible that some deception could introduce other harms, such as leaving participants with inaccurate information about their having health problems. Sometimes, researchers may provide true information in the debriefing form, such as providing real statistics about social facts when false facts were provided in the experiment, or reminding undergraduate participants that the average undergraduate student has high intelligence when they were falsely told that they lacked intelligence. The goal is to offset the harm incurred by the false information.

A third important principle is the privacy of participant data. Two aspects of participant data privacy are anonymity (i.e., the degree to which participants’ identifying information is disassociated from their study data), and confidentiality (i.e., whether researchers keep participants’ identifying information to themselves). Where possible, it is usually advisable to maintain the anonymity of participants’ data by dis-associating participants’ identifying information (e.g., name, email address) from their response data. This may have several advantages, such as protecting participants’ privacy rights. It also permits researchers to share data with others without having to compromise participants’ privacy. In some cases, it is necessary for data to be non-anonymous at least temporarily, such as when a researcher tracks a sample of participants across multiple time points and wishes to correlate participants’ responses across time. In longitudinal research, this could mean that data is identifiable for decades! However, once data collection has been completed, it is normally possible to anonymize data afterwards, stripping data of this identifying information.

Typically, even non-anonymous data should be confidential, meaning that a researcher would not share any identifier-data associations with others, even if the researcher can personally associate identifiers with data. In summary, the general principle of participant privacy is that privacy should be maintained as far as logistically possible. Tying this back to consent, in cases where confidentiality would not be possible to extend to participants, those participants should at least know what their expectations of privacy should be, preferably when they initially provide consent.


            Psychological research spans many diverse topics and interests, but the fundamental, conceptual steps required to create high-quality research are in many ways similar. This chapter has focused on delineating research questions, selecting dependent and independent variables, issues involving the setting and population, data analysis, and ethics. However, entire chapters and articles have been devoted to in-depth explorations of each of these individual topics (and others); we have provided many references to example articles and chapters throughout. Importantly, we hope this chapter highlights often under-recognized skills that are developed through training in psychological science. Undergraduate programs in psychological science should prepare students to effectively evaluate research methodological issues including sample size, risks associated with third variables, whether questionable research practices were likely to have been present, whether rigorous ethical safeguards were in place, whether appropriate statistical tests were used, and whether researcher conclusions are consistent with the results from statistical tests based on the methodology employed, These are all skills that are valued beyond academia: from evaluating research for policy development to interpreting survey data gathered in an applied setting, professionals who display thoughtful and critical consideration of the quality of evidence are highly sought after.

Key Words and Concepts

Alpha level: The level of risk for making Type I errors within Null Hypothesis Significance Testing

Anonymity: The degree to which participants’ identifying information is disassociated from their study data

Applied research: Applied research questions tend to focus on a specific problem. They typically emphasize predicting or influencing an outcome rather than in understanding why that outcome is predicted or influenced by a given factor

Basic research: Basic research is aimed at formulating and testing fundamental psychological principles governing a domain of interest

Between-participant design: This research design examines causal relationships by randomly assigning people to only one of two or more conditions and examining differences emerging between the groups

Confederate: Someone who appears to be a participant in a research study to other participants, but who is actually part of the research team playing a role to create desired research conditions

Confidentiality: Whether researchers keep participants’ identifying information to themselves

Confirmatory research: When the researcher specifies and tests what factors are likely to cause an effect, and perhaps even when and why such factors have their effects

Constructs: Those elements in a study thought to vary across people and/or situations.

Construct validity: The degree to which a measure specifically and sensitively captures its intended construct

Context: The population of interest (people) and the location and time (setting) in which research takes place

Control group: An experimental group that receives a treatment that is not expected to influence the variables of interest, but that typically simulates other aspects of the experimental design. Control groups serve as a base-line comparison for the treatment groups

Convenience sample: A sample that is not randomly selected, but instead is obtained from a more readily available subgroup of the population

Correlational Research: A research paradigm that lacks random assignment to condition and/or experimental manipulations of variables. As a result, causal conclusions are less tenable with this type of methodology

Counterbalancing: An experimental method where the order of treatments within an experiment is intentionally varied across participants to reduce the risk of treatment order influencing the results. Thus, counterbalancing reduces a third variable concern of treatment order

Covary: The extent to which variables increase and/or decrease in similar patterns

Criterion validity: A specific type of construct validity: the extent to which a measure is associated with other measures that should logically be related to its construct

Debriefing document/debriefing: When participants are fully informed of the research design and purpose at the conclusion of the research study. This may be done through a document, or through a discussion with a researcher

Deception: When, in order to ensure participants respond as naturally as possible within a study, participants are not given a complete understanding of the research. This can occur through incomplete details being provided, or through participants being actively misled by the experimenter(s). If deception is approved for use by the reviewing ethics committee, a debriefing document and post-study consent are typically required

Demand characteristics: When participants act, behave, or report in a certain manner, due to their perceptions of what is desired of them, or perceived pressures from the experimenter

Dependent variables: Variables that are thought to be influenced by the independent variables

Descriptive research questions: These research questions typically focus on simply describing one or more psychological or behavioural constructs in a given domain of interest

Descriptive statistics: A numerical summary of the overall pattern of responses for a given measure within a sample. Typically descriptive statistics include indices of central tendency and variability

Direct measures: Measures where participants self-report on questions being asked of them. Participants are aware of the measure, and respond to that measure directly

Discriminant validity: A specific type of construct validity: when a measure shows minimal associations with irrelevant variables

Exclusion criteria: Characteristics that would render a participant ineligible to participate in a research study

Explicit measures: Measures that assess relatively controlled and deliberative types of participants’ responses

External validity: The degree to which study results can be extended to populations other than the research sample studied

Experiment: A research methodology where participants are randomly assigned to conditions, and the researcher manipulates at least one independent variable to test the influence of the specified independent variable(s) on a dependent variable. Cause-and-effect conclusions are facilitated by using an experimental methodology

Exploratory research: Research that is undertaken when researchers do not have specific expectations, but rather more general notions regarding the relationships among the constructs of interest

Face validity: A specific type of construct validity: when a measure appears to reflect its construct according to either experts or laypeople

Field research: Studies in which subtle aspects of an environment are altered and participants are unaware that they are being studied, therefore permitting an authentic assessment of participant reactions

Filler measures: Scales that researchers do not wish to evaluate that are included to confuse participants’ understanding of the study’s purpose

Funnel interview: Participants are asked increasingly probing questions about their experiences in the study and what they thought the study’s purpose was

Generalizability: The degree to which study results can be extended to populations other than the research sample studied

Hypotheses: Researcher expectations regarding patterns of relationships among variables that are specified in advance, and formally tested in research

Implicit measures: Measures that assess relatively uncontrolled and automatic types of participants’ responses

Inclusion criteria: Characteristics that a participant must display to eligible to participate in a research study

Incremental validity: The concept of using two or more types of measures to predict behavior is more powerful than using only using one type of measure

Independent variables: Variables that are thought to influence the dependent variable(s)

Indirect measures: Measures that assess participants on the construct of interest without directly asking participants to provide self-assessment of their psychological attributes

Inferential research: The exploration of relations among psychological and behavioural constructs.

Informed consent: The ethical principle that participants should have a reasonable understanding of what they will be expected to do in a study, and the likely benefits/harms that may affect them

Inter-rater reliability: The extent to which independent evaluators judge something in a convergent manner

Internal validity: Researchers’ ability to make causal claims about the relationship between study variables

Interval data: Data based on scale response options that are equally spaced

Manipulation check: A measure, other than the dependent variable, to assess whether a manipulation had the desired effect

Manipulations: Variables that are deliberatively chosen and changed so as to influence the dependent variables of interest

Measurement invariance analysis: A mathematical procedure to establish whether items of a measure perform similarly across groups at a psychometric level

Minimization of harm: The ethical responsibility of all researchers to reduce participants’ exposure to loss, pain, and/or damage as much as possible

Mundane realism: The degree to which an experiment applies to “real world” situations

Nominal scales: Any measure for which scores are given as categorical labels

Null Hypothesis Significance Testing (NHST) Commonly used statistical tests that test the hypothesis that the relationship of interest does not exist in the population against a collected sample of data

Observational measures: These measures allow social scientists to obtain information from their subjects through evaluating participants’ overt behaviours

Operationalizing: The process of deciding how to go about measuring the defined constructs with a specific measure

Ordinal scales: Scales that provide a rank ordering of the data

Physiological measures: Measurement of physical body responses including, but not limited to, heart rate, blood pressure, neuron activity, and galvanic skin response

Population of interest: Typically a very large group of people about whom the researcher wishes to draw conclusions

Power analyses: mathematical techniques to determine an appropriate sample size based on the experimental design and desired statistical power

Prime: stimulus used to activate a word or concept in a participant’s mind, either with (supraliminal) or without (subliminal) the participant’s being consciously aware of it

Psychological research methods: The principles and procedures that guide psychologists’ exploration of research questions

Psychometric: Relating to the evaluation of the quality of psychological measurement, such as through assessment of measures’ structure or validity

Quantitative data: Information that is expressed in some numerical form

Questionable Research Practices (QRPs): QRPs cover a wide range of data collection, analysis, and reporting practices, most of which are considered problematic because they can undermine the statistical conclusion validity of a study. These include, but are not limited to, selective reporting of research findings, and failure to report data manipulations. These practices can often inflate Type I error rates.

Random Assignment: An experimental feature where every participant has an equal likelihood of being placed in any of the experimental conditions

Random sample:  A subset of the population of interest that is selected to participate in a research study in such a way that ensures that every member of the population under investigation has an equal probability of being included in the sample

Ratio data:  Data collected based on response options that are equally spaced, and additionally include a true zero point

Reliability:The consistency with which a measure provides the same information

Reverse causation: The possibility that a variable purported to be the cause of another variable is actually its consequence.

Sample: A subset of the population of interest that is selected to participate in a research study

Sample size: The number of observations (e.g., participants) collected for a study. The number of observations in a sample must be large enough to make valid conclusions using the chosen statistical techniques

Self-report measurement: Measures where participants are directly asked to report their standing on the psychological or behavioural construct of interest, typically using some form of structured rating scale

Socially desirable responding: The tendency for respondents to distort their responses in order to present themselves favourably

Statistical conclusion validity: The degree to which an analysis of a study has produced an accurate conclusion regarding the existence of relationship between variables

Statistical power: The ability to detect inferential patterns between variables with statistics where the patterns truly exist

Test-retest reliability: Consistency of responses across multiple time points, obtained using the same respondents and same measure

Third variable problem: When establishing causation between variables, the possibility that an unaccounted-for variable is the true cause of their association

Transparency: The degree to which participants can understand the true purpose of a study

Type I error: When a researcher concludes that there is a statistically significant relationship between variables of interest based on the null hypothesis significance test, but this statistical finding is inaccurate because in reality there is no such relationship

Type II error: When a researcher concludes that there is not a statistically significant relationship between variables of interest based on the null hypothesis significance test, but this statistical finding is inaccurate because in reality there is such an relationship

Validity: The degree to which a measure accurately quantifies what it intends to measure

Within-participant design: Experimental design in which participants each undergo every treatment condition



Archer, J., & Coyne, S. M. (2005). An integrated review of indirect, relational, and social aggression. Personality and Social Psychology Review9(3), 212-230.

Asendorpf, J. B., Banse, R., & Mucke, D. (2002). Double dissociation between explicit and implicit personality self-concept: The case of shy behavior. Journal of Personality and Social Psychology, 83(2), 380-393.

Baker, R. C., & Guttfreund, D. O. (1993). The effects of written autobiographical recollection induction procedures on mood. Journal of Clinical Psychology49(4), 563-568.

Banerjee, A. & Chaudhury, S., (2010). Statistics without tears: Populations and samples. Industrial Psychiatry Journal, 19(1), 60-65.

Baron, R. A., & Bell, P. A. (1975). Aggression and heat: Mediating effects of prior provocation and exposure to an aggressive model. Journal of Personality and Social Psychology31(5), 825-832.

Beck, A. T., Steer, R. A., Brown, G. K. (1996). Beck Depression Inventory-II. San Antonio.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour2(1), 6-10.

Berkowitz, L., & Donnerstein, E. (1982). External validity is more than skin deep: Some answers to criticisms of laboratory experiments. American Psychologist37(3), 245-257.

Berkowitz, L., & LePage, A. (1967). Weapons as aggression-eliciting stimuli. Journal of Personality and Social Psychology7(2), 202-207.

Button, K. S., & Muna, M. R. (2017). Powering reproducible research. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 22-33). Hoboken, NJ: John Wiley & Sons.

Cacioppo, J. T., Berntson, G. G., Lorig, T. S., Norris, C. J., Rickett, E., & Nusbaum, H. (2003). Just because you’re imaging the brain doesn’t mean you can stop using your head: A primer and set of first principles. Journal of Personality and Social Psychology85(4), 650-661.

Cacioppo, J. T., & Tassinary, L. G. (1990). Inferring psychological significance from physiological signals. American Psychologist45(1), 16-28.

Cialdini, R. B. (2009). We have to break up. Perspectives on Psychological Science4(1), 5-6.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago, IL: Rand McNally College Publishing Company.

Costa, P. T., Jr., & McCrae, R. R. (1991). NEO five-factor inventory (NEO-FFI), Form S (Adult). Lutz, FL: Psychological Assessment Resources.

Costa, P. T., & McCrae, R. R. (1993). Psychological research in the Baltimore longitudinal study of aging. Zeitschrift fur Gerontologie, 26(3), 138-141.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science25(1), 7-29.

De Houwer, J. (2006). What are implicit measures and why are we using them? In R. W. Wiers & A. W. Stacy (Eds.), The handbook of implicit cognition and addiction (pp. 11-28). Thousand Oaks, CA: Sage.

Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The long way from α-error control to validity proper: Problems with a short-sighted false-positive debate. Perspectives on Psychological Science7(6), 661-669.

Field, A. (2019) Discovering Statistics [Homepage]. Retrieved from https://www.discoveringstatistics.com/

Friese, M., Hofmann, W., & Schmitt, M. (2008). When and why do implicit measures predict behaviour? Empirical evidence for the moderating role of opportunity, motivation, and process reliance. European Review of Social Psychology, 19(1), 285-338.

Gawronski, B., & De Houwer, J. (2014). Implicit measures in social and personality psychology. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (2nd edition). New York, NY: Cambridge University Press.

Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making26(3), 213-224.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology74(6), 1464-1480.

Harpe, S. E. (2015). How to analyze Likert and other rating scale data. Currents in Pharmacy Teaching and Learning, 7(6), 836-850.

Hauser, D. J., & Schwarz, N. (2015). It’s a trap! Instructional manipulation checks prompt systematic thinking on “tricky” tasks. Sage Open5(2), doi:10.1177/2158244015584617.

Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods48(1), 400-407.

Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development34(3), 177-189.

Holden, R. R., & Jackson, D. N. (1979). Item subtlety and face validity in personality assessment. Journal of Consulting and Clinical Psychology47(3), 459-468.

Ilgen, D. R., & Favero, J. L. (1985). Limits in generalization from psychological research to performance appraisal processes. Academy of Management Review10(2), 311-321.

Iyer, R. (2019) YourMorals.org [Homepage].  Retrieved from https://www.yourmorals.org/index.php

John, O. P., & Benet-Martinez, V. (2014). Measurement: Reliability, construct validation, and scale construction. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods  in social and personality psychology (2nd ed., pp 339-369). New York, NY: Cambridge University Press.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. doi:10.1177/0956797611430953

Klauer, K. C., & Teige-Mocigemba, S. (2007). Controllability and resource dependence in automatic evaluation. Journal of Experimental Social Psychology, 43(4), 648-655.

Lakens, D. (2019) Improving your statistical inferences [Online Course]. Retrieved from https://www.coursera.org/learn/statistical-inferences

Latané, B., & Darley, J. M. (1970). The unresponsive bystander: Why doesn’t he help? Century Psychology Series. New York, NY: Appleton-Century Crofts.

LeBel, E. P., & Gawronski, B. (2009). How to find what’s in a name: Scrutinizing the optimality of five scoring algorithms for the name‐letter task. European Journal of Personality23(2), 85-106.

Lieberman, J. D., Solomon, S., Greenberg, J., & McGregor, H. A. (1999). A hot new way to measure aggression: Hot sauce allocation. Aggressive Behavior25(5), 331-348.

Lilienfeld, S. O. (2017). Psychology’s replication crisis and the grant culture: Righting the ship. Perspectives on Psychological Science12(4), 660-664.

Lilienfeld, S. O., & Waldman, I. D. (Eds.). (2017). Psychological science under scrutiny: Recent challenges and proposed solutions. Hoboken, NJ: John Wiley & Sons.

Millsap, R. E., & Meredith, W. (2007). Factorial invariance: Historical perspectives and new problems. In R. Cudeck and R. C. MacCallum (Eds.), Factor analysis at 100: historical developments and future directions (pp. 131-152). Mahwah, NJ: Lawrence Erlbaum Associates.

Moses, S. N., Houck, J. M., Martin, T., Hanlon, F. M., Ryan, J. D., Thoma, R. J., … & Tesche, C. D. (2007). Dynamic neural activity recorded from human amygdala during fear conditioning using magnetoencephalography. Brain Research Bulletin71(5), 452-460.

Nisbett, R. E. & Cohen, D. (1996). Culture of honor: The psychology of violence in the South. Boulder, CO: Westview Press.

Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231-259.

Nuttin, M. J., Jr. (1985). Narcissism beyond Gestalt and awareness: The name letter effect. European Journal of Social Psychology, 15(3), 353-361.

Orne, M. T. (1962). On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. American Psychologist17(11), 776-783.

Orpinas, P., & Frankowski, R. (2001). The Aggression Scale: A self-report measure of aggressive behavior for young adolescents. The Journal of Early Adolescence21(1), 50-67.

Page, M. M., & Scheidt, R. J. (1971). The elusive weapons effect: Demand awareness, evaluation apprehension, and slightly sophisticated subjects. Journal of Personality and Social Psychology, 20(3), 304-318.

Paulhus, D. (1991). Measurement and control of response bias. In J. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes, Vol. 1. (pp. 17–59). New York, NY: Academic Press.

Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An inkblot for attitudes: affect misattribution as implicit measurement. Journal of Personality and Social Psychology89(3), 277-293.

Petty, R. E., Fazio, R. H., & Briñol, P. (2012). Attitudes: Insights from the new implicit measures. New York, NY: Psychology Press.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods1(2), 115-129.

Shadish, W.R., Cook, T.D., and Campbell, D.T. (2002) Experimental and auasi-experimental designs for generalized causal inference. Bellmont, CA: Wadsworth Cengage Learning.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science22(11), 1359-1366.

Simms, L. J. (2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass2(1), 414-433.

Strack, F., Martin, L. L., & Schwarz, N. (1988). Priming and communication: Social determinants of information use in judgments of life satisfaction. European Journal of Social Psychology, 18(5), 429–442.

Sweetland, A. (1972). Comparing random with non-random sampling methods. Oxford, England: Rand Corp.

Wagenmakers, E. J., Verhagen, J., Ly, A., Matzke, D., Steingroever, H., Rouder, J. N., & Morey, R. D. (2017). The need for Bayesian hypothesis testing in psychological science. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 123-138). Hoboken, NJ: John Wiley & Sons.

Westermann, R., Spies, K., Stahl, G., & Hesse, F. W. (1996). Relative effectiveness and validity of mood induction procedures: A meta‐analysis. European Journal of Social Psychology26(4), 557-580.

Widaman, K. F., & Grimm, K. J. (2014). Advanced psychometrics: Confirmatory factor analysis, item response theory, and the study of measurement invariance. In H. T. Reis, & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (2nd ed.). New York, NJ: Cambridge University Press.


Please reference this chapter as:

Vaughan-Johnston, T. I., Fabrigar, L. R., & Lawrence, K. (2019). Research methods in the psychological sciences. In M. E. Norris (Ed.), The Canadian Handbook for Careers in Psychological Science. Kingston, ON: eCampus Ontario. Licensed under CC BY NC 4.0. Retrieved from https://ecampusontario.pressbooks.pub/psychologycareers/chapter/researchmethods/



Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Canadian Handbook for Careers in Psychological Science Copyright © 2019 by The Authors is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book