Replication and reanalysis of old data is critical to doing good science. We have discussed at various points how to increase the replicability of studies (e.g. here, here, here, and here), and have covered a few meta-analyses (here, here). Maybe it is because technology is constantly changing, and because we forget where we leave files and on which hard or flash drive, but it has been relatively rare to see researchers go back and examine their old data.
More recently, however, the Psychonomic Bulletin & Review (PB&R) has been encouraging publications that reexamine research previously published in PB&R. Today we will cover a study just out in PB&R in which the authors looked back on their own initial findings, and critically reassessed a theory. The authors reported that the article is the product of the second author’s skepticism after the first author joined the lab about the validity of the original results based on the analysis.
To remedy the original study’s shortcomings, the authors attempt to re-analyze the original study following the first’ author’s memory and the paper, then present a novel analysis of the original data, along with several additional tests of a computational model using the original data. They also provide a cautionary tale about keeping track of your old analyses and emphasize the importance of open datasets.
Researchers Sean Duffy and John Smith sought to replicate analyses from their original study (Duffy et al., 2010, also published in PB&R), which compared a Bayesian model of categorization to human performance on a spatial judgment task.
The original model, known as the category adjustment model (CAM), is a Bayesian model of categorization judgments. It has been used to explain why individuals’ category representations are often close to the average (e.g. the size, length, loudness, voice onset time, etc. of the stimuli in an experiment). This phenomenon is known as the central tendency bias. CAM, as a model, has been applied to explain a wide variety of tasks. For example, it has been used to explain how the most common way of saying a word influences speech perception, or how the average properties of speech sounds impact word and sound learning. It has also been applied to spatial categories and facial recognition.
CAM accounts for the central tendency bias by stating that participants have an imperfect memory of the stimulus and retain a running average of the stimuli they have seen in addition to remembering how variable the stimuli have been. Below is a simple graphic of the two contributing factors that influence judgments, taken from the original paper by Duffy and colleagues:
As one example of a task that might demonstrate this bias, imagine participants are shown images of cats of different sizes, which disappear from view, and the participants attempt to recreate the size of the last cat from memory by sizing an image up or down. The size of these stimulus cats might come from a symmetrical normal distribution, or it could be skewed towards small sizes, skewed towards large sizes, or the cats might be uniformly varying in size.
The central tendency bias would predict that participants keep track of the distribution, thus remembering the average size of the cats, as well as how variable the cats have been in size. Because participants are biased toward the mean, their responses are typically closer to the mean than expected. More important, judgments should be biased against recent stimuli, other than by how their observation influences the running average and variation. The figure below shows a simple experimental timeline that depicts these different-sized cats over time.
The task of Duffy et al. (2010) and the reanalysis in the latest article by Duffy and Smith involved not pictures of cats but horizontal lines on computer screens.
In their line length judgment task, participants see horizontal lines that disappear, and then must adjust a line to the size they recall of the line that just disappeared. Typically, participants tend to judge lines as being closer to the mean length of the lines they have seen than is initially expected. Duffy and colleagues noted in their original study that participants often judged lines that were short as being longer than they actually were, whereas lines that were long were judged to be shorter than they actually were. This was true no matter whether the majority of lines were short, long, or came from a uniform distribution of line lengths. Below is the graphic that shows error (deviation from the horizontal line) across the three distributions, taken from the original paper:
Maybe you feel skeptical about these results – well, I am sure we have all had discussions with advisors, collaborators, and people who come to the microphone after conference talks who noticed a hitherto hidden flaw or questioned our results, analysis, initial theoretical conclusions, or all of the above. Thankfully, this time that skepticism led to the 2017 Duffy and Smith paper.
Rerunning the original analysis: In the years that have passed since the original study appeared (2010), researchers have gotten better about recording the exact steps of their analyses, which improves reproducibility at the level of analysis. So, the authors attempted to recreate the original analysis of the data from two experiments.
In the original analysis, it appears that the authors modeled the length of the line participants were estimating from memory as a function of the stimulus size, the running mean, and the mean of the preceding 20 trials, without exactly specifying how they calculated that mean. In the original paper, the mean was important, and the recent trials were not, but when reanalyzed, the mean was unimportant, and the recent trials were. This appears to be bad news for the CAM.
A different analysis using repeated measures regression: The original analysis worked with aggregate data – averages over all of the time points for each of the line sizes, and ignoring compounds between the different variables. In a new analysis, the authors used repeated-measures mixed-effects models (which we have covered before here) to account for participants’ responses on each trial as a function of the running mean, the mean of some set of recent targets (e.g. the last 3, 5, 10, 15, and 20 targets), and the target’s actual length.
In this analysis, the authors found even more bad news for CAM: the previous targets continue to influence participants’ responses, and the running mean only matters if models do not include information about previous targets (which are part of the running mean). This highlights the importance of modeling at the individual trial level, because many variables like running averages are correlated with previous observations.
The authors also took advantage of the computational properties of CAM to further explore their original data. They looked at two previously unconsidered consequences of the model that should be evident in the behavioral data:
(1) As we get more observations of stimuli in an experiment, the variability in our perception of those stimuli goes down. With each observation, we become more and more confident in the mean. Duffy and Smith decided to test whether participants became more sensitive to the mean length of the lines they are judging as the experiment goes on. More bad news for CAM: The authors found that participants do not seem to show greater bias toward the average when that average becomes less uncertain, that is as the experiment goes on.
(2) Related to the previous question, the authors asked whether participants avoid making responses that are shorter than the minimum or longer than the maximum, as the trials go on. Again, bad news: Participants do not avoid making responses that are impossible after having learned the distribution of the stimuli.
So what can we take away from this latest article by Duffy and Smith, other than that it is always worth looking at your old data with a skeptic’s eye?
Duffy and Smith provide us with a bit of advice: Wherever possible, (1) Make all of your data available at the individual observation level, and analyze it at that level; (2) Make your analyses available for re-examination; (3) If you have variables that can be operationalized in many different ways (like the influence of the previous k trials), test multiple variants of those and report all of them; (4) Consider whether your model has to have a specific structure, or whether other types of models (e.g., non-Bayesian versions) would make the same predictions; (5) Strongly consider experimental results that go against your model’s predictions, or that go against an entire class of models.
Altogether, the replication crisis need not be a crisis at all – sometimes looking at old data can provide new insights. Furthermore, everyone benefits from getting a better understanding of an experiment’s results in the statistical methods of the present, even if the new results go against popular thinking.
Psychonomic Society journal articles featured in this post:
Duffy, S. & Smith, J. (2017). Category effects on stimulus estimation: Shifting and skewed frequency distributions – A reexamination. Psychonomic Bulletin & Review. DOI: 10.3758/s13423-017-1392-7.