Confidence intervals? More like confusion intervals

In his influential book, Understanding the New Statistics, Geoff Cumming makes the case that psychologists should change the way they report their statistics. Psychologists, he argues, would be far better off if they stopped reporting p-values and started reporting confidence intervals. When I read his book I was struck by the information presented about p-value misconceptions based on a survey in which people were presented with a scenario where a researcher conducts an experiment and their statistical test returns a p-value of .01.

Six possible interpretations of the p-value are given and respondents are asked which they endorse. Of the academic psychologists surveyed, 97% endorsed at least 1 interpretation, as did 80% of methodology instructors in a replication of this survey. All six interpretations should have been recognized as false, and these results are rightly understood as evidence that researchers do not understand p-values.

Cumming uses this evidence to argue that we should instead report confidence intervals because they are easier to grasp and are more intuitive. Indeed, he says, “I suspect the new statistics may license us to think about our results in ways we’ll recognize as natural, and perhaps the ways we’ve secretly been thinking about them all along,” and “[a]dopting the new statistics may not feel like shifting to a different world, but as a release from restrictions” (Cumming, p. 26). Unfortunately, the new statistics are just as restricted as the old statistics.

Misinterpreting confidence intervals

In a more recent survey published in Psychonomic Bulletin & Review, researchers Rink Hoekstra and colleagues put confidence intervals to the test with a survey analogous to the earlier p-value surveys. Hoekstra and colleagues surveyed 442 first-year statistics students, 34 master’s students, and 118 psychology researchers (all in the Netherlands), and posed the following question: “Professor Bumbledorf conducts an experiment, analyzes the data, and reports: ‘The 95% confidence interval for the mean ranges from .1 to .4!’” Respondents were asked whether they would endorse any of 6 statements, and they were told that “all, several, or none of the statements may be correct.” I have reproduced their Table 1 below, which contains the statements and a summary of their results. As Stephan Lewandowsky pointed out in his introduction to this digital event, all of these statements are false.

On average, respondents endorsed about 3.5 statements. For illustration, I will just focus on one of the statements from the survey, statement four: “There is a 95% probability that the true mean lies between 0.1 and 0.4.”

When I discuss confidence intervals with my colleagues, the fact that this kind of statement is false is often one of the hardest to accept. If the probability that any interval selected at random contains the true mean equals 95%, surely, they say, there must be a 95% probability that my interval ranging from .1 to .4 contains the true mean.

The Fundamental Confidence Fallacy

In the recent paper “The Fallacy of Placing Confidence in Confidence Intervals” that stimulated this digital event, researchers Richard Morey and colleagues call this the Fundamental Confidence Fallacy. The Fundamental Confidence Fallacy is fundamental in its most literal sense—it has been prevalent since the very introduction of confidence interval theory by Jerzy Neyman 80 years ago. In fact, as Dave Giles points out, when Neyman explained his theory to the U.S. Department of Agriculture in 1937, there was immediate confusion; future-Nobel-Laureate Milton Friedman interjected to Neyman:

“Your statement of probability that he will be correct in [95] per cent of the cases is also equivalent to the statement, is it not, that the probability is [95] out of 100 that [the true mean] lies between the limits [.1]and [.4]?” (I’ve changed the numbers to reflect the above example.)

Even the brightest minds can get turned around when trying to interpret confidence intervals! Perhaps we should not be too hard on the survey respondents. Neyman was quick to correct Friedman, in saying that, “[the true mean] is not a random variable. It is an unknown constant. In consequence, if you consider the probability of [the true mean] falling within any limits, this may be either zero or [one], according to whether the actual value of [the true mean] happens to be outside of those limits or within.”

In other words, the procedure by which our confidence interval was generated can have a probability attached to it, but a specific interval cannot.

In classical (i.e., frequentist) statistics, all probabilities are defined in terms of long-run frequencies. A procedure that generates confidence intervals can be said to have long-run properties, but when one tries to assign probabilities to single events the sense of the long-run is lost. von Mises put it best when he said: “Our probability theory [frequentist statistics] has nothing to do with questions such as: ‘Is there a probability of Germany being at some time in the future involved in a war with Liberia?’ ” (p. 9).

Another example: If I were to ask you what the probability is that the 100 billionth digit of pi is 2, the natural answer would be 10% since you have no reason to favor one number being in that digit over any other. But this kind of statement is impossible to answer in the classical statistical framework; it is either a 2 or it is not a 2.

In their paper, Morey and colleagues give a very simple example that shows why taking the pre-data probability associated with a confidence interval procedure and applying it to a particular interval derived from that procedure in a post-data inference leads to pathologies in reasoning: “Consider the problem of estimating the mean of a continuous population with two independent observations, y1 and y2. If y1 > y2, we construct an [sic] confidence interval that contains all real numbers (−∞, ∞); otherwise, we construct an empty confidence interval. The first interval is guaranteed to include the true value; the second is guaranteed not to. It is obvious that before observing the data, there is a 50% probability that any sampled interval will contain the true mean. After observing the data, however, we know definitively whether the interval contains the true value. Applying the pre-data probability of 50% to the post-data situation, where we know for certain whether the interval contains the true value, would represent a basic reasoning failure” (p. 3).

Of course, the above example is purposefully artificial. It makes little sense to assign 50% probability to an empty interval because we know for a fact that the true value is not contained in it (because no values are contained in it). However, similar trouble arises for researchers in the real world when one calculates a confidence interval for the omega-squared effect size (i.e. the proportion of variance accounted for in an ANOVA design). In constructing confidence intervals for omega-squared it is possible for a 95% CI to contain negative values; this is a serious problem because omega-squared (like R-squared) is bounded from zero to one! If one is to interpret 95% confidence intervals as indicating where the true value lies with 95% probability, it would be silly to assign probability to values in the interval that are known to be impossible.

The Plausibility Fallacy

Moreover, the above example also exemplifies The Plausibility Fallacy (This fallacy exists in several varieties, sometimes involving plausibility, credibility, or reasonableness of beliefs about the parameter as explained in a Psychonomics poster by Morey and colleagues).

If you have read Cumming’s book, then you’ve seen this fallacy in action because it is given as interpretation 2 of confidence intervals: “First, a CI indicates a range of values that, given the data, are plausible [emphasis original] for the population parameter being estimated. Values outside the interval are relatively implausible.Any value in the interval could reasonably be the true value [emphasis added]” (p. 22). How can this interpretation be valid if it is possible for a confidence interval to contain values that we know are by definition impossible? To make matters worse, modern statistical software simply truncates the lower limit of the interval at 0, which hides the negative values from sight and leaves researchers unaware of this problem. And this is a very common problem; it happens whenever the p-value is greater than α/2 (i.e., whenever p>.025 for a 95% CI). If you read the same literature I do, you know that p-values above .025 are quite common!

Where do people learn this stuff?

Recall that nearly 60% of the researchers and 50% of the master’s students surveyed by Hoekstra and colleagues endorsed statements embodying the Fundamental Confidence Fallacy (statements 4 and 5 in the table above). Where does such widespread confusion arise? In discussing their survey, Hoekstra and colleagues speculated that such incorrect interpretations are likely taught in introductory textbooks. Hoekstra, Morey, and Wagenmakers followed up on this speculation at this year’s meeting of the Psychonomic Society, where they presented an analysis of popular introductory statistics textbooks with respect to CI fallacies. The results were grim: They found that 70% (17 out of 23) of the textbooks surveyed included definitions of confidence intervals analogous to the Fundamental Confidence Fallacy, while 26% (6 out of 23) included statements analogous to the Plausibility Fallacy. It is not surprising if researchers misinterpret CIs if they are taught the fallacies from the beginning of their education.

Not so fast?

Are we entitled to conclude, then, that researchers and students in psychology have no reliable knowledge about the correct interpretation of confidence intervals? Miller and Ulrich, in a recent rejoinder to the article by Morey and colleagues that triggered this series of posts, argue that the survey data of Hoekstra and colleagues do not substantiate this conclusion and that indeed their article includes misleading suggestions about the correct interpretations of confidence intervals.

Miller and Ulrich are critical of the definition of CIs given by Hoekstra and colleagues (“If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean”) because it does not mention anything about the particular sample. Miller and Ulrich argue that “[A]lthough formally correct,” surely, “[a]ny interpretation of sample data should in some way summarize the information provided by the sample” (p. 3). They propose a modified definition of a confidence interval that is analogous to statement 4 in the table above but elaborated to clarify the role of probability:

“Statement 4’: If the current sample is one of the 95 % of all samples with relatively small values of |M-μ|/s, then μ lies in the interval 0.1–0.4.” (p. 3; M refers to the sample mean).

Unfortunately, this is merely a tautological restatement of the definition of a confidence interval, and thus not very helpful; as Morey and colleagues point out in their reply to Miller and Ulrich, this is equivalent to saying, “If the conditions under which μ would be in the interval hold, then μ lies in the interval; these conditions will hold in 95% of samples” (p. 7).

Miller and Ulrich also propose that the survey respondents might have had a different definition of probability in mind than Hoekstra and colleagues assume, and when this alternative definition is taken into account, some statements should be marked correct.

For example, Miller and Ulrich argue that when asked to make probability statements, “people often implicitly consider the long-run frequencies of the various outcomes of some random process, even when the question refers to a particular outcome” (p. 4). To support their claim, they note that, “we sometimes flip a coin, cover the result, and ask, “what is the probability that the coin is heads?” Everyone answers 50%, indicating that they think of the question in terms of a long-run sequence of many possible flips” (p. 4).

This seems like quite a stretch. I do not follow their reasoning here; clearly there was no reference at all to any long-run sequence in that answer. Alternatively, this kind of answer is entirely commensurate with the argument that people want to interpret probability statements in terms of plausible degree of belief, not long-run frequencies. Indeed, this is precisely the kind of probabilistic interpretation that Bayesians make!

Perhaps probability is simply too controversial a term, and we should stick with “confidence,” as in statement 5, “We can be 95% confident that the true mean lies between 0.1 and 0.4.” In fact, Miller and Ulrich quote three statistical experts as making similar statements when writing their textbooks (but we have already seen that textbooks can hardly be trusted). One such statement, by the famous Bayesian statistician Morris DeGroot no less, was:

“After the values … in the random sample have been observed, the values of A and B can be computed. If these values are A = a and B = b, then the interval (a, b) is called a confidence interval for μ with confidence coefficient0.95. We can then make the statement that the unknown value of μ lies in the interval (a, b) with confidence0.95.” (DeGroot, 1989, p. 337, as cited by Miller and Ulrich)

If an expert such as DeGroot can make these kinds of confidence statements, they argue, surely it is reasonable that the students interpret them this way as well. Of course, introductory textbooks often include oversimplified definitions so that their readers are not overwhelmed. However, a later edition of the textbook by DeGroot removed this sentence and issued a corrective:

“[T]he observed interval…is not so easy to interpret…[S]ome people would like to interpret the interval…as meaning that we are 95% confident that μ is between [the observed confidence limits]. Later…we shall show why such an interpretation is not safe in general” (DeGroot and Schervish, 2012, p. 487, as cited by Morey and colleagues).

Concluding comments

Confidence interval theory is not at all intuitive. Statements that assign a probability to individual confidence intervals are not allowed, statements that interpret confidence intervals as a range of plausible values are not allowed, and statements that interpret the width of the confidence interval as an index of precision are not allowed, as Stephan Lewandowsky reported in the preceding post.

The “new” statistics are just as restrictive as the old statistics, which is not surprising considering they are both built around calculating long-run relative frequencies. Clearly there is a desire to make these kinds of interpretative statements. Why don’t we give the people what they want? The Bayesian statistical framework not only allows, but also encourages all of these kinds of interpretations (and more). In the Bayesian framework one must specify the prior distribution for the parameters of interest based on relevant theoretical information the researcher has prior to conducting their research; careful consideration of the prior distribution yields useful probability statements about parameters.

The Bayesian approach satisfies our natural intuitions about how to talk about probability and inference. Edwards, Lindman and Savage say it best, in the paper that in 1963 first introduced Bayesian inference to the field of psychology: “The Bayesian approach is a common sense approach. It is simply a set of techniques for orderly expression and revision of your opinions with due regard for internal consistency among their various aspects and for the data. Naturally, then, much that Bayesians say about inference from data has been said before by experienced, intuitive, sophisticated empirical scientists and statisticians. In fact, when a Bayesian procedure violates your intuition, reflection is likely to show the procedure to have been incorrectly applied. If classically trained intuitions do have some conflicts, these often prove transient” (p. 195).

References for the main papers mentioned in this post:

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5), 1157-1164.

Miller, J., & Ulrich, R. (2015). Interpreting confidence intervals: a comment on Hoekstra et al. (2014).Psychonomic Bulletin & Review. doi: 10.3758/s13423-015-0859-7

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2015). The Fallacy of Placing Confidence in Confidence Intervals. Psychonomic Bulletin & Review, doi: 10.3758/s13423-015-0947-8.

Morey, R. D., Hoekstra, R., Rouder, J. N., & Wagenmakers, E.-J. (2015). Continued misinterpretation of confidence intervals: response to Miller and Ulrich. Psychonomic Bulletin & Review, doi: 10.3758/s13423-015-0955-8.

Confidence intervals? More like confusion intervals

You may also like

#PSBigData: From Big Data to Big Experiments

Trumping Bonferroni to keep your ANOVAs honest

#beyondAcademia: What skills can experimental psychologists offer?

4 Comments