Ask any chemist and she will tell you that an electron microscope needs to be carefully calibrated. If not, its measurements are not trustworthy enough for research purposes. As psychologists, our laboratories typically do not include electron microscopes, but we do employ various measurement devices. At the most basic level, this includes our statistical “machinery.” Though we might not normally think of it this way, sample means and regression estimates are measurement devices that must be properly calibrated just like microscopes.
Consider the standard lab experiment in which participants are randomly assigned to different treatment groups. Upon finding a significant difference between the group means, the researcher may conclude that a hypothesis is supported. But this researcher is really making an inference—specifically, that the pattern of sample means reflects the pattern in the parameters of interest, the actual population means.
As with the electron microscope, then, the question of calibration arises. How accurately do those sample means estimate the corresponding population means? More importantly, how accurately should they estimate the population means to be trustworthy enough for research purposes?
Our paper tackles this question. We argue that many fields and research questions simply are not sufficiently quantifiable to specify how accurate sample means should be (or how accurate regression estimates should be) to be trustworthy.
At first glance, this claim might appear surprising in light of existing standard methods for assessing accuracy, such as confidence intervals. But what we don’t have is a standard for what constitutes a sufficiently narrow confidence interval. Our approach to creating a standard was to create its antithesis, namely a foil—a method of estimating population means that is so clearly inappropriate that researchers would want the estimates they use to be more accurate than it.
To do so, we created something called random least squares. Random least squares also estimates the population means, but when it does so, it begins by scrambling important information. Specifically, it randomizes which means are estimated to be larger than others and by how much.
Why is that of interest to psychologists? Because random least squares yields random conclusions about treatment effects. Thus, whether a treatment works, and even in what direction it works, is largely determined by chance with random least squares. Clearly, then, we should want our sample means to be more accurate than random least squares. Otherwise, our conclusions about treatment effects would be based on estimates that are further from the truth than nonsense estimates that yield guesses about treatment effects.
We derive the probability that standard methods (e.g., sample means and regression estimates) will be more accurate than random least squares as a function of sample size, effect size, and number of treatment groups. We call this probability v. Simply put, we argue that a v of at least .5 is necessary to trust sample means. That is, they should be expected to beat random least squares at least half the time.
Beating a guess sounds easy enough, right? Quite counter-intuitively, it isn’t.
For reasons related to a statistical property called shrinkage, making an a priori choice to randomize the relationships among estimates can have advantages when there is too much noise – that is, when sample sizes and/or effect sizes are too small. Our paper shows that one can achieve traditional levels of statistical significance and still lose to random least squares more than half the time. Alarmingly, if the meta-analytic literature on retrospective power analyses is correct, there are several fields for which the median study has a v less than .5.
Our results bring a new perspective on replication and interpretability. Many scholars have argued that our discipline would be better served by focusing on estimation accuracy and confidence intervals. But this raises difficult questions. What makes a sufficiently narrow or too wide confidence interval? In this paper we have attempted to define a minimum standard for estimation accuracy that flows from a simple, normative argument.
At the very least, we should beat a random guess.
(this post is a summary of the Clifford T. Morgan Award-winning paper “Comparing the accuracy of experimental estimates to guessing: a new perspective on replication and the `crisis of confidence’ in psychology” by Clintin Davis-Stober and Jason Dana, published in Behavior Research Methods.)