We all want more power. Statistical power, that is, to detect an effect in the presence of noise even if it is small. But is this always true?

Power seems to be something we should all strive for, just like replicability.

The reality is however a bit more nuanced: A few weeks ago I wrote about the problems that can arise even with replicable findings if they rely on a problematic dependent measure. It turns out that statistical power, likewise, comes with its own intriguing nuances that are worth exploring.

A recent article in *Behavior Research Methods* by E.J. Wagenmakers and numerous colleagues tackles the power issue and comes to the conclusion that although “power is useful in planning an experiment, it is less useful—and sometimes even misleading—for making inferences from observed data.”

Why? And how?

On the classical frequentist view, power is defined as an experiment’s ability to detect an effect when it is present—in other words, it’s is the probability of avoiding a Type II error. Because this probability depends on the size of the effect, as well as on the sample size, it is somewhat challenging to determine power ahead of time—unlike the Type I error rate which can be set by choosing a suitable level for “alpha” (usually .05) which is invariant with sample size or effect size.

For that reason, power is usually examined *before* an experiment is conducted, and it is used to determine the sample size needed to detect an effect of a certain presumed magnitude with some desirable probability. If your power exceeds .8, you are probably doing pretty well.

But of course power can also be used to interpret an experiment *after* the data have been collected. For example, if a statistical test is non-significant, but the power of the experiment is high, then researchers may be tempted to interpret that outcome as strong evidence *for* the null hypothesis. After all, if I have a powerful experiment that would have detected an effect if it had been present, then surely the absence of a significant result implies that the effect is also absent? Wagenmakers and colleagues show that this inference can be problematic.

The key to their argument is the distinction between the overall distribution of p-values expected under the null and alternative hypotheses on the one hand, and the implications of an actually observed p-value. The two notions are illustrated in the figure below.

Each panel shows the distribution of p-values expected under the null hypothesis (H0; blue) line, and under the alternative hypothesis (H1; red line). Several aspects of the figure are noteworthy: First, under H1 the probability of encountering a small p-value is higher than under H0—this simply reflects the fact that when an effect is present, experiments tend to pick it up and yield significant results. When no effect is present, the p-value is uniformly distributed and any value of .05 or less is equally likely. (As an aside, if you find this surprising, a good explanation and pointers to teaching resources can be found here. Remember that we are talking about the distribution of *p*, not *t*, or *z*, or *F* or whatever other test statistic may have been used to generate the *p*-value).

Now consider the differences between panels: Which do you think shows the situation with greater statistical power? If you guessed B, then your guess is correct: The difference in the area between the two curves is far greater in panel B than in panel A, and small p-values under H1 are overall more likely in panel B. Thus, given that an effect is present, it is more likely to be detected in panel B than panel A.

But here is the third, and most intriguing, feature of the figure: suppose we have conducted an experiment and we have observed a particular single p-value (shown by the vertical dashed line in both panels). If we now use that p-value to draw inferences about likelihoods, then what matters is not the overall area under the two curves, but the ratio between the values of the ordinate—i.e., the likelihood ratio—at the observed p-value. What do those ratios tell us?

In the low-powered experiment (panel A) the observed p is 9.3 times more likely under H1 than under H0—good evidence, in other words, to reject H0. Thus, despite having low power a priori, the significant result from this experiment was quite informative.

In the high-powered experiment (panel B), by contrast, the p is only 1.83 times more likely under H1 than under H0. Despite being a powerful experiment, the significant result here was quite *uninformative*.

The problem does not end there: Suppose the p-value in panel B had escaped significance. Would the high power give us confidence to infer that the null hypothesis is true? No. A non-significant p (say, p=.12) would yield relatively non-informative insight about the null hypothesis.

The preceding statements about informativeness were based on hypothetical scenarios arising from presumed underlying distributions. They were thus illustrative only, and it is reasonable to ask whether they can be translated into practical inferences, beyond merely waving a yellow flag about the interpretation of powerful experiments.

The answer is yes. To draw powerful inferences without being distracted by the power (or lack thereof) of an experiment, all we need to do is to abandon the classic frequentist approach in favour of Bayesian statistics.

## 1 Comment