A case study on the challenges of theory testing in psychology

“But when does lack of ‘simplicity’ in the protective belt of theoretical adjustments reach the point at which the theory must be abandoned?” – Lakatos, 1976

What does it take to falsify a psychological theory? This question sounds straightforward: if you find data that are inconsistent with the theory, you reject the theory. But in the real world, when testing theories, we have to specify how exactly the broad theory ‘grounds out’ in a specific set of data. This is not always easy.

If we say, “Coffee makes you alert,” how do we then measure alertness? There are many possible waysand each could give a different answer. This is because researchers can have very different ideas about which measure best captures constructs like alertness. If some of these measures happen to show decreased alertness, does this mean that we should reject the theory as a whole? Imagine you give people coffee and then test how well they can do on a challenging game that requires alertness but also requires very steady hands. Because coffee makes some people jittery, you find that, on average, people do worse at this game after drinking coffee. Does this mean we should reject the broad theory that coffee makes you alert? Many people would say no and point to the idea that the broad theory might be true while the specific measure used to test it in this case is flawed.

Our article, written by Maria Robinson, Jamal Williams, John Wixted, and Tim Brady (pictured below), published in Psychonomic Society journal, Psychonomic Bulletin & Review is driven by the core question, “What does it take to falsify a psychological theory?” This paper builds on recent work on theory assessment practices in psychology. Using an accessible ‘case study’ to examine a fundamental idea from philosophy of science, we demonstrate how the ‘protective belt’ of auxiliary assumptions (like whether your measure of alertness is a ‘good’ one) affect our ability to test theories. While broad, verbal theories, are often hard to rigorously test, and many people expect that very well-specified, computational models – where parameters of the model appear to straightforwardly map on to constructs like ‘alertness’ – would not suffer from this same issue and would be easy to evaluate. In our paper, we show how this assumption fails and what we can do to avoid it.

Fig 1. Authors from left to right: Maria Robinson, Assistant Professor in Psychology at the University of Warwick; Jamal Williams, Postdoctoral Scholar in Psychology at Yale University; John Wixted, Professor in Psychology at University of California, San Diego; Timothy Brady, Professor, University of California, San Diego.

We focus on a concrete example of prominent theories — well-specified in mathematical models — which have been repeatedly tested by sophisticated computational modellers. In particular, we target the theory that people can “remember 3-4 items in working memory,” a classic idea of the ‘slot model ’ of memory that when faced with many items to remember, only a few (N, usually 4) of those items can be remembered and all others are completely forgotten. This intuitive model is tested against another popular theory, often called ‘continuous resource ’ theory, that posits that some information is always present in visual working memory – even though it may be highly degraded and essentially ‘noise-like.’

We specifically examine tests of these theories in a widely cited article reported in a high-profile outlet (Rouder et al., 2008, Proceedings of the National Academy of Sciences) as well as a replication and extension of this study, which took the first steps towards addressing limitations of the original study. We use these articles as case studies because they continue to have a significant influence on how researchers theorize about and measure visual working memory and are, therefore, incredibly important in the study of visual memory. Moreover, both articles involve testing well-established models by highly advanced computational researchers, which we would assume is the best-case scenario for assessing psychological theories.

In our reanalysis, we find that these papers still fall prey to the incredible difficulty of sorting out which aspects of a theory are core to it, and which are just “auxiliary” assumptions needed to ground out the theory in specific data and methodology (Fig. 2). We show, through a systematic reanalysis, that when core theoretical and analytic assumptions are checked, the data in these papers are either non-diagnostic or support entirely opposite conclusions to those made in the original work.

Fig 2. A schematic showing the range of conceptual and practical auxiliary assumptions made as part of theory assessment in psychology. Conceptual auxiliary assumptions, include how to instantiate theories as computational models. These auxiliary assumptions can be ‘theory general’ meaning they apply to both theories – in the context of these studies, an example of a theory general auxiliary assumptions is whether people’s decision bias is stable or varies with changes in memory load. Conceptual auxiliary assumptions can also be ‘theory specific’ and apply to specific theories only – for example, these can relate to the types of distributions (e.g., Gaussian versus Gumbel) that researchers assume best capture the underlying distribution of memory signals in continuous resource models. Finally, researchers also make practical auxiliary assumptions, such as which tasks to use to measure psychological constructs so that they yield diagnostic data, as well as how to analyze data in the presence of measurement noise. Each of these auxiliary assumptions must be distinguished from core theoretical assumptions – in this case, whether visual memories are stored continuously or in an all-or-none fashion – before drawing conclusions from empirical data.

In other words, the two theories of visual working memory were tested using different or unchecked auxiliary assumptions, and these ancillary decisions, rather than the merits of slot vs. resource theory per se, led to the specific conclusions in these papers (Fig. 3). Therefore, while the original articles reported support for slot models, our reanalysis indicates that they actually show more support for resource models (Fig. 4).

Fig. 3. (A) An example trial in the popular change detection task, which is commonly used to measure visual working memory. In this example, participants need to remember five colored squares and their spatial locations. After a brief delay, participants must indicate whether the probed item is the same or different than the item originally presented at that location. (B). A schematic of all-or-none and continuous resource models, as well as the shapes of their theoretical Receiver Operating Characteristic (ROC) curves, which are often used to assess computational models in these cognitive tasks. Discrete-slot models predict that memory fails in an all-or-none way and predict a linear ROC. Continuous resource models predict that memory representations are continuous and predict curvilinear ROCs. (C). These models make qualitatively distinct predictions about the shape of the ROC, but, importantly, there is a portion of ROC space where the models can make overlapping predictions (Gray shaded region), making data that falls in this region non-diagnostic for discriminating between linear and curvilinear functions. As shown in the aggregate ROC data, experiments that use only a few (e.g., three) base rate manipulations may generate data that falls within this non-diagnostic region. This issue motivated Donkin et al. (2014) to run experiments with more base rate conditions (two ROC curves with five instead of three points, shown in the two far-right panels).

Fig 4. A) Results of reanalysis where all-or-none (or classic ‘discrete-slot’ models) and continuous resource models are matched on their ‘theory-general’ auxiliary assumptions regarding how model parameters change as a function of memory load. Comparisons are made only between ‘matched’ pairs of all-or-none and continuous resource models; these have the same number of parameters and are compared with the negative log-likelihood (NLL). When models are matched on theory-general auxiliary assumptions, they all fit the data equally well, except in Donkin et al. (2014; Exp 2) where versions of the resource model outperform their slot-model counterparts. (B) Top: Results from comparing the best fitting resource model to all variants of the discrete slot model in Experiment 2 of Donkin et al. (2014), where empirical ROCs are diagnostic for comparing both models because they span a sufficiently wide range of ROC space. In these comparisons, we find consistent evidence for the continuous resource model using the only well-calibrated (unbiased) model comparison metric (Akaike Information Criterion, in this context), as shown by the model recovery simulations (bottom). Circles and stars denote medians.

Lakatos responds to his question by pointing to the idea that the ‘protective belt’ of auxiliary assumptions is difficult to assess based on parsimony alone. Instead, auxiliary assumptions should be seen as a series of subsidiary theories, which support the core theory and may warrant revision or replacement as part of theoretical development. While Lakatos’ philosophy of falsification was nuanced, we use an accessible and prominent example to outline basic steps towards identifying and testing such subsidiary theories and addressing current challenges in falsification in psychology. Together, our article integrates and builds on critical contemporary issues in theory assessment practices, and we hope researchers from various sub-disciplines engage with this work.

Links to related papers

For readers interested in additional contemporary work on theory assessment practices in psychology we recommend a starting sample of the following articles:

Featured Psychonomic Society article

Robinson, M., Williams, J.R., Wixted, J. T. & Brady, T. (2024). Zooming in on what counts as core and auxiliary in theory assessment: A case study on recognition models of visual working memory. Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-024-02562-9

Authors

Maria Robinson

Hello, my name is Maria Robinson and I’m an assistant professor in the Behavioral Science group in the Psychology Department at the University of Warwick. I received my PhD at the University of Illinois Urbana-Champaign in 2019 and was an NRSA postdoc in Timothy Brady’s lab at the University of California, San Diego. My primary research interest is in combining formal modeling and analysis with empirical work to study a range of cognitive processes, like attention and memory. I’m also interested in topics on best practices in measurement and theory assessment in psychology.
View all posts
Jamal Williams

Hi! I'm Jamal and I like to study how we process and maintain information from the visual world. I’m currently working in Kia Nobre’s Brain & Cognition lab as a postdoctoral associate at Yale. My research primarily focuses on how audition influences visual perception and how memories interact with visual attention. I’m also interested in how we store and use visual memories and in how we measure the strength or fidelity of these representations. My work incorporates experimental, computational, and electrophysiological methods to better understand the cognitive and neural mechanisms that give rise to visual perception and memory. Before joining Yale, I completed my PhD at UC San Diego where I was supported by the NSF GRFP and advised by Tim Brady and Viola Störmer. Prior to graduate school, I received my BS in Cognitive and Behavioral Neuroscience from UC San Diego, where I was fortunate to work in multiple labs, including those of Macagño, Vul, Brady, and Störmer.
View all posts