Reproducible studies may not generate reliable individual differences

The scientific process relies on the ability to replicate findings. This is as true in psychology as in any other discipline. If findings can be reliably replicated, researchers can draw theory-changing conclusions from relatively few data points.

But all is not well, and psychology has been dealing with the famous “replication crisis.” Recently a very large scale initiative known as the Replicability Project attempted to replicate a number of psychological studies with theoretically important results.

Some of the strongest results, which nearly always replicated, included the Stroop effect:

However, overall, only 39% of the replication studies came out the same as the original papers. In light of so many failures to replicate what were thought to be foundational results for entire subfields, replicability and open data movements are gaining steam, and the inherent scientific value of replications has gained renewed prominence.

At the same time that we are interested in the replicability of experimental effects, individual differences also come into play. Individual differences are present in nearly every study – perhaps only some participants are sensitive to an experimental manipulation, either because of intrinsic characteristics (e.g., where they grew up, what languages they speak, etc.) or because of shorter-term factors, like how hungry participants were when they came into the lab.

But even though experimenters cannot control participants’ characteristics, individual differences are still worth understanding, and taking them into account can help us build more robust theories of cognition.

Like experimental effect sizes and directions, measures of individual differences may not always yield identical results. A person might take a working memory assessment on one day, score near the top, and then get a decidedly average score the next day. The consistency of these scores across repeated tests by the same person is known as test-retest reliability.

Reliability can be assessed in a few different ways. Intuitively, if your measure of, say, working memory has test-retest reliability, then the measurement at time point 1 should be correlated with the measurement at time point 2. One common way of measuring this correlation is the intra-class correlation (ICC) function:

Somewhat counterintuitively, measures that demonstrate high reliability are ones where variability between individuals is also high. If everyone’s performance on the Stroop task takes a hit to the exact same extent when participants have to remember the last five digits presented to them while performing the task, then all participants are equally sensitive to a manipulation.

Because there is low variability in this effect, we cannot measure individual differences. Low variability can also arise from participants all being highly similar to each other. If variability is low because of such high similarity, then you could almost swap your scores out for mine.

In order for us to test reliability, we need a manipulation that helps us differentiate everyone, which is the opposite of what we typically do in cognitive science experiments where we hope to identify cognitive processes that hold for everyone.

The dual problems of experimental replicability and intra-person reliability have been further demonstrated in a recent study published in the Psychonomic Society journal Behavioral Research Methods by Craig Hedge, Georgina Powell, and Petroc Sumner, who conducted three experiments across seven common cognitive tasks looking at test-retest reliability.

In Experiments 1 and 2, these tasks included such household names as the Stroop task, the Eriksen flanker task, a go/no-go task, and a stop signal task. In Study 3, the authors looked at test-retest reliability in the Posner cueing task, the Navon task, and a spatial-numerical association of response codes (SNARC) task.

For a brief refresher, here are the tasks used by Hedge and colleagues:

The Stroop task involves naming the color of a word that appears on the screen. Sometimes, the word is the same as the color, so participants would say “green” when they saw a green “GREEN”. At other times, participants name an incongruent color, so they would say “green” when they saw a green “RED.”
The Eriksen flanker task is similar, except participants have to specify which direction an arrow is pointing. They can either see an array of arrows pointing in the same direction (<<<<<), an arrow alone (–<–), or an arrow pointing in the opposite direction (>><>>).
In the go/no-go task, participants are told to press a button only when certain conditions hold.
The stop-signal task is similar to go/no-go, except a specific symbol is responded to.
The Posner cueing task is similar to these tasks as well, except that participants must decide which of two boxes on the screen contains an X, with an arrow that either points in the right direction or in the wrong direction.
The SNARC task stands for “spatial-numerical association of response codes” and manipulates the tendency for the right side of the body to mean “bigger” and the left side to mean “smaller” by occasionally putting small numbers on the right and large numbers on the left.
The Navon task is similar to the Stroop task, where a large letter (e.g. H or S) is presented that is composed of either the same letters (an S of little S’s) or of the opposite letters (an S made up of little H’s).

These tasks are summarized in the figure below, and all have been used in studies looking at individual differences in abilities like working memory or executive function.

Hedge and colleagues administered each of the tasks and evaluated participants’ performance at two time points, each three weeks apart. The raw data and summary data are available on their Open Science Foundation page.

The researchers broke up each experiment into two time points, the first observation period, and then the second observation period. Using the formula above to test for whether participants’ performance was roughly the same during the second observation period as the first, the researchers found that some tasks showed consistently high correlations between the first and second observation sessions, or high reliability. Other tasks showed only moderate reliability. The figure below shows two tasks and the measures that showed consistent behavior across experiments, the reaction times in the Stroop task and the error rates in the go/no-go task.

In the first, the Stroop RT measure, we see a line that is mostly 1-to-1 in slope and that is spread out over a range of values. The same is true of the go/no-go errors. This means that for each person (represented by a plotting symbol), their performance at time 1 was (roughly) mirrored at time 2.

In other cognitive tasks, there was low reliability, even though the same participants were involved as in the high-reliability tasks. Looking at Stroop errors, the correlation between the two sessions is surprisingly weak:

In some other cognitive tasks with some measurements, like the global error cost to the Navon task, there was almost no correlation between participant performance across the two sessions.

The fact that a number of these studies failed to show strong, reproducible individual differences is surprising under one definition of reproducibility. For some experimental paradigms, like the Stroop task, virtually every lab can demonstrate that saying “green” to name the ink when the word says “RED” is difficult. In those studies, individual differences may not come into play at all, because the studies were not designed to differentiate participants based on their performance.

Hedge and colleagues caution that even those replicable behavioral tasks may not always demonstrate reliable individual differences. They also point out that not having to take into account individual differences is actually the goal of many psychology studies. If variability among individuals is low, then we can more easily see effects that we hope to call cognitive universals or principles.

The authors name a number of alternatives to the intra-class correlation coefficient for studying individual differences. Increasingly, researchers use mixed effects models to model individual behavior, which takes into account individual differences by modeling all participants in an experiment separately, while simultaneously learning how an experimental manipulation affected the entire set of participants.

Alternatively, we can draw on item response theory, where we consider how difficult each trial was on average, as often happens in standardized tests like the SAT or the GRE, before assigning a score to an individual’s performance. And finally, it might be feasible to change some study manipulations from between subjects to within subjects to account for individual differences.

For now, it is clear that the right method to use to study or take into account individual differences probably depends on what the research question is.

Despair not, readers! Even though the authors state that some of the above tasks are not sufficiently reliable to use in individual-differences research, the cognitive effects that we are often interested in studying are reproducible. Because everyone shows similar, consistent responses, the individual differences are slight relative to the effect of the experimental manipulation. The Stroop task is hard for everyone, and because the task is similarly hard for everyone, the effects are consistent, and replicable. But it also makes it harder to tell who is better at the Stroop task and why. So, if your aim is to understand individual differences, you might want to consider other methods for assessing them.

Psychonomic Society article featured in this post:

Hedge, C., Powell, G., & Sumner, P. (2017). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods. DOI: 10.3758/s13428-017-0935-1.

Author

Cassandra Jacobs

Cassandra Jacobs is a graduate student in Psychology at the University of Illinois. Before this, she was a student of linguistics, psychology, and French at the University of Texas, where she worked under Zenzi Griffin and Colin Bannard. Currently she is applying machine learning methods from computer science to understand human language processing under the direction of Gary Dell.
View all posts

1 Comment

Steve Lindsay says:

November 30, 2017 at 5:35 pm

Thanks for this article — nice job. But just for the record I don’t think it is accurate to describe the Replicability Project as testing “foundational results for entire subfields.” They tested a higgledy piggle assortment of effects from a few very prominent journals. That’s not to disparage the RP nor to take issue with the reality of the “replication crisis,” just to make the point that psychology has many “foundational results” that are rock solid.