Trumping Bonferroni to keep your ANOVAs honest

Chemists have test tubes and Bunsen burners. Astronomers have telescopes, computer scientists have computers, and psychologists and cognitive scientists have ANOVAs. If there is one tool that is being used across virtually all domains of psychology, cognitive science, and neuroscience, it is the Analysis of Variance or ANOVA.

Somewhat ironically, ANOVA does not actually test for differences in variance but for differences in means across levels of our treatment variable (the reason it is called ANOVA is interesting but beyond the present scope). In most real-world applications, multiple treatment variables are combined in an experiment, usually in what is called a “fully-crossed” or orthogonal arrangement. For example, we may randomly assign participants to one of 6 conditions formed by the variables cognitive load and distractor task, where the former has 3 levels and the latter 2, and the two are fully crossed. Cognitive load describes the proportion of time in between memoranda in a complex-span task during which attention is captured by the distractor task, and in our hypothetical study cognitive load is low, medium or high. That is, the time available to perform each distracting activity is either long, medium, or brief. There are two different types of distractor tasks, both involving arithmetic verification statements but in two different formats: “4 + 2= 5?” versus “four + two = five?” Participants respond by pressing yes-no keys to each distractor task, and the dependent measure is the proportion correct recall across all trials within each condition.

You would probably decide within a split second that a 2 x 3 between-subjects ANOVA constitutes an appropriate analysis of this experiment, and you would recognize equally quickly that this analysis returns two main effects and an interaction. Each would presumably be tested using some criterion for significance, conventionally at the famed .05 level.

So far, so bad.

Yes, there is a problem: The probability that your ANOVA would return a significant result even if none of the experimental variables had any effect whatsoever (a so-called Type I error) is around 14%, or nearly three times greater than the 5% postulated by your criterion for significance. This is because the familywise error rate for the three tests is the result of applying the equation 1 − (1 − .05)³, which happens to be .14. (Derivations of the equation can be found here.)

Most of us are familiar with the problem of multiple tests of significance, and perhaps because of this familiarity we may not take the problem sufficiently seriously—at least that is what is suggested by a recent article in the Psychonomic Bulletin & Review.

Researchers Angélique Cramer and colleagues examined the issue of multiple hypothesis tests in ANOVAs and showed that the problem is neither trivial nor widely known.

The problem of an inflated Type I error rate is non-trivial because the probability of occurrence increases notably with the addition of further factors into an ANOVA: Add a third, fully-crossed experimental variable to the 2 × 3 scenario from above (e.g., to form a 2 × 3 × 3 design), and the probability of finding at least one significant result when no effects are actually present increases to around 30% [1 − (1 − .05)⁷]. (These examples assume independence between the various effects, which in turn implies that the cell sizes are equal and [very] large.)

What this means is that if you run an experiment in which you examine memory performance in a complex-span task as a function of three fully-crossed variables, namely (a) the color of the experimenter’s left sock (green vs. blue), (b) the number shown on the lab door (2.13 vs. 2.14), and (c) whether the ambient temperature in the room is 18.2°C or 18.4°C, you will find some significant result in one of three attempts. So all you need is four project students and you can write a paper entitled “socking the complex-span task—but only when you feel warm”. You get the idea.

Cramer and colleagues established the prevalence of this problem by examining all 2010 publications in 6 top journals in the field: Journal of Experimental Psychology: General (N=40 articles); Psychological Science (N=285); Journal of Abnormal Psychology (N=88); Journal of Consulting and Clinical Psychology (N=92); Journal of Experimental Social Psychology (N=178); and Journal of Personality and Social Psychology (N=136).

Of those 819 articles, nearly half (48%) used multiway ANOVAs, similar to the examples from above. However, only 1% of the articles applied a correction to the ANOVAs to guard against the inflated Type I error rate that is inherent in the analysis whenever more than a single treatment variable is involved.

It appears that we have all heard of the “familywise error rate”, and many researchers apply it with great diligence during their follow-up tests of a significant interaction—but few realize that that interaction itself may have been the result of a Type I error in a complex design.

Cramer and colleagues provide several remedies for this seemingly widespread problem.

First, and perhaps most important, it must be realized that this problem applies only to the exploratory use of ANOVA—that is, as an analysis technique that is applied to an experiment in order to “see what’s going on,” without any a priori hypotheses. Things are different when a researcher uses a multiway ANOVA for confirmatory purposes; that is, to test one or more a-priori postulated hypotheses. For example, suppose we formulate the a priori hypothesis that “4+2=5?” is processed differently from “four + two = five?” within the 2 x 3 ANOVA introduced above. In that case, the “family” for the familywise error rate no longer encompasses all hypotheses implied by the design (i.e., three), but only the one we specified a priori as being the hypothesis of interest.

But how do we know that a hypothesis was formulated a priori? The researchers would know of course, but what if they didn’t write it down and honestly forgot that the interesting hypothesis about warm blue socks only occurred to them in the shower after the data from that experiment became available?

This brings us to the second, related remedy offered by Cramer and colleagues: Preregistration. If a study and its hypotheses are preregistered—as is common among medical clinical trials; see www.clinicaltrials.gov—then there can be no ambiguity about when the hypothesis was postulated. Preregistration will tell us. And remember, if a hypothesis was formulated a priori, then it does not matter how complex the design is, the test is still conducted with the error rate set at .05, as expected and as desired. You can start preregistering your experiments today, here.

If it is too late to preregister the experiments you are about to analyze, Cramer and colleagues offer several other remedies that correct the inflated Type I error rate by statistical means. Briefly, perhaps the simplest solution is to use the Bonferroni adjustment, which is to divide the desired significance level by the number of tests—thus, instead of using .05, one would use .05/3=.01667 for the 2 x 3 ANOVA example from above. Any effect whose p-value is above .01667, even if it is below .05, would not be considered significant. Cramer and colleagues discuss several interesting extensions of this basic approach.

Given that 99% of recent ANOVAs did not use any of those corrections, there is considerable opportunity for the work of Cramer and colleagues to have an impact.

Reference for the paper discussed in this post:

Cramer, A., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R., Waldorp, L., & Wagenmakers, E.-J. (2015). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review (DOI: 10.3758/s13423-015-0913-5).

Trumping Bonferroni to keep your ANOVAs honest

You may also like

The full moon and my toddler: The role of unexpected events in causal learning

99 ±12 hours till #Psynom15

How long till today’s cigarette will make me ill? Time estimation revisited