The Goldilocks zone of sample size: Getting it just right

Credit: The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Picture Collection

“This chair is too big!” she exclaimed. So she sat in the second chair. “This chair is too big, too!” she whined. So she tried the last and smallest chair. “Ahhh, this chair is just right,” she sighed. But just as she settled down into the chair to rest, it broke into pieces! – Southey

The story of “Goldilocks and the Three Bears” is an allegory for finding the “goldilocks zone” the place where things are “just right.” This notion is applied to open science where open science is just science done right, but what exactly does this mean for our approach towards achieving replicability of studies?

Before the “replication crisis,” it was probably unlikely that your peers commented on the size of your sample of participants (often referred to as “N”). Not now. Nowadays we are more aware that small sample sizes overestimate the actual size of the effect. Consider Goldilocks’s story. Goldilocks might conclude that bears are meaner than humans because they chased her out of their house. But if she met more bears she might find that they are no meaner than humans. Having only met three bears, she doesn’t have enough data. The same applies to research: Not having enough data can lead researchers to make incorrect conclusions.

*Open science is science just done right.Credit: Imming and Tennant*

This issue has been expressed emphatically as a main contributor to the problems with replication. Consequentially, a key resolution is to collect large samples. This recommendation provides critical guidance for people working in institutes with aims to support scientists conduct the best research. I speak from experience as I, at the Open Science Office at the University of Mannheim, advise scientists on current standards in open science. But should I always recommend collecting large samples?

At this point, you might be thinking that I am about to tell you not to collect large amounts of data. Don’t worry, I’m not. The size of the sample depends on what your goals are, at least according to Brent Wilson, Christine Harris, and John Wixted (picture below). In the recent article titled “Theoretical false positive psychology” published in the Psychonomic Society journal, Psychonomic Bulletin & Review, they revisit the idea that larger samples are better. They agree that maximizing sample sizes for experiments is a good thing. However, they draw a distinction between measurement-focused research (i.e., precisely measuring an underlying effect size) and theory-focused research (i.e., testing a prediction about a theory). Note that these two types åof research are not mutually exclusive, however for the sake of simplicity they are spoken as such. One can of course imagine a scenario where both are considered in tandem. In any case, the bottom line is that researchers should maximize sample size for the former and use a different optimization strategy for the latter.

*Authors of the featured article. from left to right, Brent Wilson, Christine Harris, and John Wixted.*

But why? If I collect a large sample and I get a significant result that must mean that my theory is correct? Right?

The authors explained that when researchers collect large amounts of data they increase the likelihood of wrongly concluding that their significant effect supports your theory. This is because many factors can yield a small effect in the data disguised as a finding that supports their theory. The likelihood of this can increase with larger sample sizes. In other words, their theory is false, but their experiment identified a significant effect that is in the same direction as the predictions made by their theory.

Recall that one recommended solution to improve replicability is to maximize sample size:

Gotz et al. (2021) recently endorsed large-N studies, arguing that the resulting accumulation of small effects will provide an indispensable foundation for cumulative psychological science. – Wilson et al. (2022)

But this argument is debated. Indeed, by running large-N experiments, we do have a more replicable form of science where we can precisely measure the smallest of effects in separate experiments. Take, for instance, the Brain Wide Association Studies (BWAS). These studies used vast amounts of neuroimaging data to examine differences in brain structure and function. About the findings, the contributors wrote,

BWAS associations were smaller than previously thought. – Marek et al. (2020)

As sample sizes grew into the thousands replication rates began to improve and effect size inflation decreased. – Wilson et al. (2022)

If a researcher’s interests lie in measurement-focused research, they now have a series of effects that are precisely measured and they could likely replicate each with large enough samples. But, for theory-focused research, you may incorrectly assume that one of these effects is theoretically important. Performing large-sample experiments or reanalysis of combined datasets comes with the potential cost of an accumulation of findings with significant but small effects which may be used to claim to support a given theory but are, in reality, theoretical false positives.

Certainly, it is not plausible that every small effect identified in this experiment is theoretically relevant, correct or interesting. So, as a consequence of running large N experiments or reanalyzing large data sets, researchers run the risk of selecting replicability at the expense of producing and testing good theory, which cannot be good for improving science.

Where does this leave researchers trying to contribute to science in a way that improves the replicability of their research without jeopardizing the theoretical contribution? Well, the solution of course is to not use small sample sizes since this would take us back to square one (i.e., overestimating the meanness of bears). Instead, a researcher should optimize their sample size in relation to the likelihood that their theory is true. This is because the evidence supporting your theory is linked to the size of your sample because collecting too much data can result in detecting small effects that are not theoretically interesting.

The take-home message is: If your interest is in precisely measuring an effect, you should use large samples, but if you are interested in providing evidence to support your theory you should optimize your sample.

To find out exactly how to do this, read the featured paper to calculate the sample size that is “just right” for your next pre-registration. For now, let’s go back to Goldilocks,

“This [sample size] is too [small]!” she exclaimed. So, she [ran another experiment]. “This [sample size] is too [small], too!” she whined. So, she tried the last and [largest sample size]. “Ahhh, this sample size is just right,” she sighed. But just as she settled down [science] broke into pieces! – adapted from Southey et al. (2017)

Much like Goldilocks, researchers face a challenge in getting the sample size “just right.” Her story is a reminder that we must adopt a more nuanced approach than previously recommended when determining our sample sizes to find the goldilocks zone – lest we move from a replication crisis into a theory crisis in our attempts to improve science.

Featured Psychonomic Society Article:

Wilson, B. M., Harris, C. R., & Wixted, J. T. (2022). Theoretical false positive psychology. Psychonomic Bulletin & Review, 1-25. https://doi.org/10.3758/s13423-022-02098-w

Author

David Morgan

David Philip Morgan is the Open Science Officer and Academic Coordinator for Research Improvement at the University of Mannheim, Germany working in research support and strategy related to open science. He is also a Post-Doctoral Scientist at the Central Institute of Mental Health, Mannheim, Germany where his research mainly focuses on the impact of sleep on memory consolidation and the reproducibility of that research.
View all posts

The Goldilocks zone of sample size: Getting it just right

Author

You may also like

Bored of your short-term memory assessment? Build a sandwich

The 95% Stepford Interval: Confidently not what it appears to be

Highly reliable organizations and your lab