Nearly every psychologist has experienced the results of an experiment coming up inconclusive, where the statistics couldn’t tell us one way or the other whether our manipulation had worked. These experiments often go straight into the file drawer, to languish forever amongst years-old consent forms. Our distaste for such null results is due in part to the philosophy of null hypothesis testing. When we observe a significant effect, we can reject the null hypothesis and claim that two variables are related or that our manipulation was successful, but when we fail to reject the null hypothesis we have less confidence in the interpretation of that outcome.
Using statistical significance as a baseline for finding effects leaves room for experimental effects simply being due to chance. With the conventional significance level of .05, 1 in 20 results (i.e., 5%) are expected to be “significant” on the basis of chance alone. Positive results are thought to be over-represented in published studies due to publication bias and the fear of publishing null results (Ferguson & Heene, 2012; Ionnidis et al., 2014).
One trade-off with statistical significance is statistical power. In contrast to significance testing, statistical power can tell us about the odds of finding an effect when one is present. When a manipulation or experiment is low-powered or underpowered, our variable of interest may have been actually effective, but we were unable to detect any effects.
This is the basic dilemma that arises from not finding a significant effect: we cannot say anything concrete if an analysis does not provide significant results. Perhaps with enough observations, or with more subjects, the effect we cared about would have become statistically significant. Or, perhaps this variable really does not affect our dependent measure.
Null results are less likely to be published and are then banished to the “file drawer”. This tendency to avoid the ambiguity of null results makes it surprising that some areas have willingly accepted null effects as evidence of certain cognitive processes. Some phenomena are defined by their persistent absence. One such phenomenon is known as implicit learning. Implicit learning is a centerpiece in visual attention research, linguistic processes, and category learning, and involves the fact that participants lack any overt knowledge of something they have apparently learned, as seen in their improved performance.
Implicit learning, where participants are argued to lack conscious knowledge of the experiment’s rules, is one of the most famous and accepted null effects. Because it is based on null effects, it inherits all the problems surrounding statistical power discussed at the outset.
One way to look at what kinds of studies are being published (both false positives and false negatives) is to conduct a meta-analysis. In a forthcoming issue of Psychonomic Bulletin and Review, Vadillo, Konstantinidis, and Shanks (2015) tackled the problem of interpreting null results as evidence of implicit learning in the contextual cuing paradigm. The contextual cuing paradigm has many slight variants but is a popular method for studying attention. Their meta-analysis ultimately included 96 studies using this technique. Importantly, Vadillo and colleagues were able to look at the distribution of results in the literature and assess the field’s confidence in the null hypothesis using meta-analysis and Bayesian statistics.
In the contextual cuing paradigm, participants are shown arrays of randomly rotated Ls with a single T embedded amongst them. The T is always rotated 90º, but the stem may be facing to the right or to the left. Participants are told to find the T as quickly as possible and to identify its orientation. Generally, participants become faster and faster at this task with practice, but they are even faster when the display is a repeat. Panel B in the figure below shows the general advantage of learning the task, as well as the advantage of seeing a repeated trial (Panel A).
The repetition benefit is considered to be subconscious because declarative memory is the gold standard in studies of consciousness and awareness (see Dienes, 2015 for a review). After the training phase, participants are shown the scenes that were repeats in the experiment interspersed with new random scenes. Participants are asked to state whether they had already seen that pattern before. Performance in that task is usually found to be at chance (i.e., people are unable to discriminate the repeated scenes from the new ones). This null result has been interpreted as a lack of awareness. However, this is exactly the kind of inference that could simply be due to the effect being very small, with insufficient statistical power to permit its detection.
One way of testing whether participants are generally performing above chance in this discrimination task is to examine the proportion of studies that report chance performance. Chance performance when we are doing a t-test or ANOVA defines significance at p < .05 when we have a particular prediction, and p < .025 in either direction when we do not have a specific prediction. These p values translate directly into how often we expect experimental results to be significant even when no effect is present, so 5% and 2.5% respectively. Among the 96 studies that Vadillo and colleagues analyzed, 21.5% reported performance significantly above chance on the recognition tests, much higher than the 2.5% and 5% that significance tests usually suggest. So, there are many more studies reporting that participants can discriminate repeated scenes from novel scenes than we expect by chance.
Publication bias is the flip side of the file drawer problem. That is, whereas the file drawer problem refers to researchers avoiding sending in studies that present null results, publication bias implies that null results that are submitted tend to not be published, as editors, authors, and reviewers favor results with significant p values. Vadillo and colleagues recognized that the 21.5% of positive results could be due to the field’s desire for positive results. But how would we know that this represents publication bias rather than a real effect?
Vadillo and colleagues resolved the conundrum by noting that small, statistically weak studies have a greater likelihood of obtaining results in either direction because these studies are highly variable. If this majority of the results showing that people can identify repeated scenes involve only a few trials or a small number of subjects, then that result is more likely to be a false positive. If publication bias is at work, then small studies should be more likely to show that people can discriminate repeated contextual cuing displays from new ones. Instead, Vadillo and colleagues found that as the number of participants and trials increased in a study, the more likely that study was to find that people can discriminate repeated scenes from new scenes in the contextual cuing paradigm, suggesting a power problem in those studies that failed to find differences.
How can we be more confident that the effects we find in our studies reflect real-world differences? More recently the field has been interested in Bayesian statistics (e.g. Kruschke, 2010), which allows us to quantify the strength of our belief in the null hypothesis (that there are no differences between groups) or an alternative hypothesis (that there are differences). We can calculate the size of the effect we obtain in our experiments and translate them into Bayes Factors. Large Bayes Factors above 3 indicate high confidence, and those closer to 0 represent low confidence.
Vadillo and colleagues converted the results obtained by the 96 contextual cuing experiments into Bayes Factors to see how much support there was for the null hypothesis, that people cannot remember what scenes were repeated and that led to better performance in the contextual cuing paradigm. In general, the null results obtained in these studies were associated with low confidence, but the positive results indicating discriminability between repeated and new scenes were generally associated with high confidence.
What does this mean for the field and the practice of null hypothesis testing? Bayesian statistics allows for alternative methods of identifying the type of result an experiment has obtained, though it also has some drawbacks. The meta-analysis presented by Vadillo and colleagues also tells a cautionary tale about the ambiguity of null results, especially when differences between conditions may be reliably detected with slightly larger sample sizes. When null results are the basis of theoretical arguments, the theory may be presenting an incomplete picture. For contextual cuing, the definition of “implicit” may need to be updated in light of statistical techniques that provide experimenters with a better understanding of their data.
2 Comments