A philosopher and a theologian are discussing their respective fields over coffee. The theologian dismisses philosophy: “You know what a philosopher is like?” he demands. “A philosopher is a man searching in a dark room for a black cat that isn’t there.” The philosopher nods. “Maybe so,” he concedes, “but it takes a theologian to find the cat.”
One of the many things that keeps researchers up at night is the “statistical artifact”. A statistical artifact is an illusory result caused by a mistake — sometimes subtle, sometimes not — in the data analysis process. Simpson’s paradox is a famous example of such an artifact: collapsing across an important variable can reverse a relationship and cause you to incorrectly believe, for instance, that smoking reduces mortality. I previously discussed related averaging artifacts in decision research. A result based on a statistical artifact is the black cat in the dark room that isn’t there: you don’t want to be looking for it, and you certainly don’t want to find it.
A new paper in Psychonomic Bulletin & Review by David Shanks shows that many findings of unconscious perception may be the result of a statistical artifact. Unconscious perception is the idea behaviour is affected by things of which we are not conscious and cannot accurately describe or articulate. Subliminal priming is one example: presenting a brand logo very quickly is thought by some to change behaviour (e.g., to favor that brand a short time later) without conscious awareness that the brand logo was displayed.
As a concrete example, suppose we are interested in the effects that quickly-presented brand logos have on behaviour. We first present brand logos and neutral pictures words at very short durations and ask people to categorize them as brand logos or not. If half the pictures are logos and half are not, people will perform at 50% accuracy if they truly cannot see the images. This is called an “awareness” measure.
In a different task, we present the images quickly again (now called “primes”), but directly follow each with a product that they have to rate in terms of likeability (called the “target”). The extent to which people like the products more when they follow their corresponding, quickly-presented brand logo, is called the “performance” measure: it measures how well the prime “works”. If we can simultaneously show a positive performance measure (the effect of the prime on behaviour) and show that accuracy is at 50% in the awareness measure, we call this subliminal priming.
But the devil is in the details, as they say. We cannot measure awareness or performance directly; they will always be measured with error. What researchers often do is choose a cutoff on the awareness measure and look at the performance measure for participants whose scores show “unawareness” by that criterion. For instance, we might select only participants who scored at or below 50% on the awareness measure. Critically, this 50% includes error: coin with a 60% chance of showing heads will often show five or fewer heads in 10 flips, and a participant who is somewhat aware of the stimulus might, nonetheless, perform below the awareness criterion. And that is the problem.
Consider the figure above. Hypothetical participants (emojis) are categorized in terms of whether they can see the subliminal prime (heart-eyed emojis) or not (neutral emojis). On the left is their true awareness, and on the right is their measured awareness. Suppose that there is no subliminal priming: participants who cannot see the prime have a true performance score of 0, and those that can see it have a performance score of, say, 10 (it does not matter what this number represents; only that it is above 0). If we knew the true awareness and performance scores for all participants, we could easily reject subliminal priming.
Because we measure the awareness scores with error, however, some people will inevitably end up in the wrong category. A proportion of unaware people will score above the criterion and be categorized as “aware”, and some proportion of aware people will be categorized as “unaware”. This latter categorization error is the problematic one, because we are only computing the performance scores for those we categorize as “unaware”.
Suppose 10% of our “unaware”-categorized participants are aware. Then their true average performance score will be .1 times 10 = 1, which is above 0. Of course, we do not know the true performance scores so, we would normally use a statistical test to establish that it is greater than 0, but this does not matter. If anyone in the population can see the prime, there is a chance of a categorization error, and hence the population average performance score must be greater than 0 (making the statistical test superfluous). The more error there is in the categorization, the more the performance score is inflated.
The root of the problem is that because we are only interested in the participants we have categorized as “unaware”, we can only make one kind of error: categorizing aware people as “unaware”. If our categorization error rate is anything above 0, we will have illusory support for subliminal priming. We have established a performance score of greater than 0 for the “unaware” group, but this is caused only by the inevitable contamination of this group by aware people.
Shanks lists dozens of studies of unconscious perception that are affected by this problem. There are likely many more affected studies, though he notes that “[i]t would be impossible to systematically collect all such studies, in part because no consistent name for the method is used and because it is applied in so many different contexts.” Of course, this problem affects much more than just the studies themselves; these studies have been collectively cited by thousands of other scientific papers.
What can those interested in unconscious processing do to avoid this problem? Shanks notes that the issue arises due to the selection. One could largely avoid the problem : for instance, by carefully calibrating the experiment for each participant to ensure that their awareness scores are close to chance, rather than presenting the same prime durations to everyone and selecting them post hoc. This would be much more work than researchers presently put into such experiments, but Shanks notes that such careful experiments were once the norm.
A second approach that would avoid the artifact is to make use of all the data: one could, for instance, model how the performance changes as a function of the awareness measure across the full range of both. This is a complex modeling endevour, but it has the advantage that focusing performance over a range of awareness has the potential to be more enlightening than simply focusing on the range around null awareness.
Data analysis is a difficult process, filled with pitfalls similar to Shanks’ artifact. Such statistical artifacts make us think we’ve found found something, when in fact we have not. Luckily, skeptical papers like Shanks’ represent science’s check on bad methods: they give us a reason to drop the black cat we’re not holding.
Article focused on in this post: