Nine bets on replicability that you will win everywhere

If you had to bet on a psychological effect replicating, what effect would you bet on?

Though it seems like an unlikely bet to be asked to make, it’s a reality for anyone conducting a psychology class demo. You want an effect that holds up no matter who your students are. You don’t want an effect that disappears at certain times of the day, or in certain classrooms. Since you’re not sure what classes students have taken before, what demos they’ve seen, and what experiments they’ve been in, you want an effect that replicates even for non-naïve participants (i.e., participants who have done the experimental task before).

When choosing a demo for a psychology research methods class, the advice I got from a seasoned lecturer was to always bet on the classic Deese–Roediger–McDermott false memory task. This task was a failsafe workhorse, and its intended effect would come through no matter how uninterested the students seemed, no matter what time the class was held, no matter what the classroom was like.

In the class demo version of this task, students listened to lists of words, and then recalled as many words as they could from each list. Some of the lists were semantically related. For example, one list went something like this: sour, candy, sugar, bitter, good, taste, soda, chocolate, honey, etc. Even though they were told to only write down words they were reasonably sure they had heard, a good proportion of the students recalled hearing the word “sweet,” though it was not in the list.

The false memory effect – remembering a word that wasn’t encountered before but that was semantically related to all the words in a list—held. Bet won. Psychology instructor: 1, failure to replicate: 0.

What effects replicate, and why?

The reproducibility crisis in science, and especially in psychology, has garnered a lot of attention and thought. A large replication effort of 100 studies found that fewer than half of cognitive and social psychology findings could be replicated.

But the picture was not as bleak across the entire field. Certain effects seemed to replicate better than others. Cognitive psychology effects fared better (50% replicated) than social psychology effects (25% replicated).

Is it the case that cognitive psychology effects, especially ones that rely on within-participant comparisons, are more robust and more likely to hold up under a variety of conditions?

In a recent article in the Psychonomic Bulletin & Review, researcher Rolf Zwaan and colleagues tested nine widely-used tasks from across three subfields of cognitive psychology to examine whether the tasks’ effects were similar under conditions that might be expected to decrease the likelihood of reproducibility – namely, in online environments and when participants completed the same task multiple times.

The researchers selected three tasks each from three domains in cognitive psychology — perception/action, memory, and language. The tasks chosen were ones thought to be robust, the workhorses of the field:

  • (1) Perception/action: Simon task. Key effect: responses are faster when a target is spatially compatible with a response (a target and response are both on the left) than when a target and response is incompatible (a target is on the left and the response is on the right).
  • (2) Perception/action: Flanker task. Key effect: Responses are faster when distractors flanking a central target are compatible (AAAAA) than when they are incompatible (AAEAA).
  • (3) Perception/action: Motor priming. Key effect: Responses to stimuli (<<) are faster when primed by compatible items (<<) than incompatible items (>>).
  • (4) Memory: Spacing effect. Key effect: Recall of words is better when word repetitions are spaced than massed.
  • (5) Memory: False memories (described above). Key effect: Words that are semantically related to words in a list are falsely recognized as presented before.
  • (6) Memory: Serial position. Key effect: Memory recall is better for items presented at the start or end of a list than for items in the middle.
  • (7) Language: Associative priming. Key effect: Responses to a target are faster when the target is preceded by a related prime than when preceded by an unrelated prime.
  • (8) Language: Repetition priming. Key effect: Responses to an item are faster when the item is repeated than when the item is new.
  • (9) Language: Shape simulation. Key effect: Responses to a picture are faster when the picture matches the shape implied in the sentence preceding the picture than when it does not match.

Participants were recruited online and completed each task twice. Some participants completed the task twice with the same materials; others completed the task with different materials each time.

The researchers examined effect sizes to see whether the effect differed across instances of completing the tasks, and whether the use of the same materials mattered.

The effect sizes turned out to be remarkably stable across repetitions, both when the same materials were used and when different materials were used. This is shown in the figure below, which plots effect sizes for tasks completed by participants the first time (Wave 1) vs. the second time (Wave 2). Each number corresponds to a task from the list above. Tasks completed with the same materials both times are plotted in blue, tasks completed with different materials are plotted in red.

Reliability vs. sensitivity: What are we trying to replicate?

Why were these tasks’ effects so stable? The authors argue that these experimental tasks are so constraining that they protect behavior from any outside influence, such as the environment people are in (since they were tested in a variety of environments), task repetition, and the specifics of task materials.

Whether this is desirable or not depends on what research questions experimental tasks are intended to be used for.

As brought up in a recent Psychonomics featured content post, there are actually different kinds of reliability that researchers may be seeking to maximize with an experimental task.

There is the reliability of observing an effect using a task across individuals and environments, as in the nine tasks described above.

Another desirable feature of a task may be to generate reliable differences in an effect between individuals. In this case, individual participants could be reliably distinguished by their task performance, even across repetitions of the task or across environments.

A final desirable feature may be the ability to reliably identify differences in context or environment using task performance. This is something I care a lot about in my research on the effects of indoor environments on people. For example, to test whether an environment, like an office, helps people sustain attention, I’d be looking for an attention measure that responds reliably to changes in environmental conditions.

Different kinds of reliability and sensitivity are desirable for different research questions, and there is no experimental task that can do it all. With the recent thoughtful discussion on how to move forward from the replicability crisis, considering what we seek to replicate and why will help the field grow.

Psychonomics article featured in this post:

Zwaan, R. A., Pecher, D., Paolacci, G., Bouwmeester, S., Verkoeijen, P., Dijkstra, K., Zeelenberg, R. (2017). Participant Nonnaiveté and the reproducibility of cognitive psychology. Psychonomic Bulletin & Review. DOI: 10.3758/s13423-017-1348-y.

You may also like