“I see a train wreck looming”—Nobel laureate Daniel Kahneman did not mince words in a 2012 email to colleagues in which he drew attention to what he considered a potential replication crisis in at least some areas of psychology. Kahneman’s skepticism was fed by failed attempts to replicate classic priming studies, increasing concerns about replicability in psychology more broadly.
In a recent article in the Psychonomic Bulletin and Review, researchers Caren Rotello, Evan Heit and Chad Dubé take up this issue by noting that “many effects have proven uncomfortably difficult to reproduce.” There can be little doubt that research findings can contribute to an advancement of science only if they are robust and replicable—and conversely, if they fail to replicate, those findings merely create “noise” in the system that can forestall theory development or may lead the field at least temporarily down a garden path. There are troubling estimates that perhaps as much as 45% of findings fail to replicate: some examples of successes and failures can be found here.
So all we need to do is to ensure that findings replicate, right? Run another study or two, preferably in different labs with different stimuli, experimenters, and participants. And if the effect holds up, then surely it’s ready for the textbooks?
Rotello and colleagues show that things are even more nuanced than that. They argue that even replicable results can be persistently and dramatically misinterpreted.
Another train wreck? A crisis of replicated errors?
Rotello and colleagues present their case by examining examples from a diverse array of topic areas, namely eyewitness memory, deductive reasoning, social psychology, and child welfare.
In all instances, the underlying problem is as simple as it is pernicious: Suppose people are asked to discriminate between two stimuli, call them A and B, by responding with “A” and “B”, respectively. It is mathematically possible that even when people’s ability to discriminate between A and B does not differ between conditions, the pattern of “A” and “B” responses may nonetheless point to the presence of an effect. This can occur when participants’ simple preference for “A” over “B” varies between conditions. Given that even a seemingly irrelevant variable as room temperature can affect people’s bias for one or the other response alternative, this bias is difficult to control.
Suppose an experimenter observes a striking difference between two conditions: In one case, people respond “A” 60% of the time during classification (and thus “B” 40% of the time), and in the other case the proportion of “A” responses rises to 85%. According to Rotello and colleagues, it does not necessarily follow that the independent variable affected discriminability. And further replications of that finding do not help—on the contrary, replications may make researchers more and more confident of the finding which in actual fact might merely represent a shift in people’s preferences for the response alternatives between conditions.
Does this theoretical scenario actually play out in reality?
Rotello and colleagues argue that it does.
Perhaps their most striking example involves referrals for child maltreatment—a situation in which a doctor or teacher reports a case of suspected child abuse or neglect. Society clearly has a great interest in ensuring that this process is accurate—we do not want abuse to go undetected but equally, we do not want to accuse parents of abuse or neglect where there is none.
In the U.S., a recent study involving more than 1,500 public agencies with more than 11,000 staff and nearly 12,000 case records, maltreatment referrals were found to be less accurate for Black than for White children when accuracy is measured with percent correct—but percent correct is the very measure that is known to be susceptible to mistaking a bias for an accuracy effect. When the referral data are analyzed another way, using a measure known as d’ that parcels out bias, the analysis either suggests greater accuracy in referral of Black children, or only a slightly greater accuracy for White children.
Rotello and colleagues present four instances in which conventional data analysis combined with extensive replications have created “textbook findings” that do not withstand scrutiny when reconsidered from the bias perspective. The authors emphasize that “the misinterpreted results are not generally noisy or unreliable. The basic pattern of data may be quite consistent across studies and labs, … and hence the interpretation of the data is similarly consistent. The interpretive errors are insidious, precisely because the effects are so systematically replicated.”
Replicability is a necessary condition for scientific progress, but it is clearly not sufficient. We must also ensure that our interpretation of the data disentangles bias from accuracy.
Otherwise we run the risk that fixing the replication crisis only creates a replicated-misinterpretation crisis.
1 Comment