Betrayed by averaging: invalid inferences when nobody is ‘average’

One of the essential goals of psychology is generalization: describing ways in which people are similar. Of course, human behaviour varies across situations, times, and individuals, and hence often defies generalization. Ignoring this variability and assuming that people are the same can lead to improper generalizations about human behaviour. In a new paper in the Psychonomic Bulletin & Review, Shi Xian Liew and colleagues describe instances where previous research in decision making has ignored individual differences and thus made inappropriate inferences.

We begin with an example to demonstrate the critical problem. Suppose Jane is the proprietor of a café, and she wishes to assess customer preference for coffee or tea in preparation for a simplification of her menu. For six weeks, Jane watches our 20 daily customers and records whether they order a coffee or a tea. At the end of the 30-workday period, she tallies up the number of coffees and teas ordered: 300 coffees, and 300 teas. She concludes that customers have no preference either way, and to simplify the menu removes coffee. Immediately, her business drops by half, and she has to close her shop.

Jane’s error is easy to spot: she assumed that the aggregated or average orders represented the individual customers. Obviously, this isn’t right; one customer may have ordered coffee every day because they didn’t like tea. As soon as coffee was removed from the menu, they stopped visiting. The error would have been obvious had Jane looked at individual ordering behaviour: if all people choose tea every day or coffee every day, then the average customer order (50% coffee, 50% tea) represents no one’s preference.

As obvious as this seems when put in terms of coffee and tea ordering, cognitive psychologists have been slow to realize the problems of averaging. This is possibly due to the fact that a sizeable proportion of our statistical training uses the average to represent a group. This is often fine when the question of interest is about that average — e.g., does an training program increase reading performance, on average — but when the question is about psychological process, averaging can lead to very misleading results because averages don’t have psychological processes. Only individual people do. An average may look very different from any or all individuals.

Problematic averaging in data analysis

In decision-making, researchers are keenly interested in how the options that are presented in a decision scenario affect that decision. For instance, suppose you were ambivalent between ordering coffee or tea. Would the knowledge that you could order soda change your ambivalence about coffee or tea? The availability of soda doesn’t change the coffee or tea in any way; traditional accounts of decision making would predict that people would be equally ambivalent, regardless of the availability of a third option.

However, human choice behaviour exhibits a number of so-called “context effects”: that is, the context in which two options are presented, such as including a third option, can affect how people regard the first two options. Following Trueblood (2012), Liew and colleagues are interested in three effects:

Similarity effect: Introducing a third option (“decoy”) that is similar to one of the two other choices, but just as desirable, will cause people to tend to choose the less similar of the non-decoy options. Example: If you are ambivalent about choosing between a mobile phone that is small but expensive and one that is larger but cheap, a similarity effect occurs when introducing an even cheaper, larger phone would tend to cause people to choose the small phone.
Attraction effect: Introducing a decoy option that is similar, but inferior, to one of two original options will cause people to tend to prefer the one of the two options that is similar to the decoy. Example: If you are ambivalent about choosing between a mobile phone that is small but expensive and one that is larger but cheap, an attraction effect occurs when introducing a small but even more expensive phone would tend to cause people to choose the small phone.
Compromise effect: Introducing a decoy option that is extreme on both of two dimensions, but equally desirable, will cause people to tend to prefer the “compromise option” that is closer to the decoy on both dimensions. This differs from the attraction effect because the decoy is not inferior to the other choices. Example: If you are ambivalent about choosing between a mobile phone that is small but expensive and one that is larger but cheap, a compromise effect occurs when introducing a very small and much more expensive phone would tend to cause people to choose the small phone.

In one experiment, Liew and colleagues presented participants with three hypothetical “suspects” (like mobile phones in the examples above) in a criminal case, along with assessments of testimony strength of two eyewitnesses (like price and size dimensions in the mobile phone examples above; for more on eyewitness testimony see “Mistaking a murderer – Eyewitness memory blindness”). Participants had to weigh the “evidence” and choose one of the suspects as “guilty”. Configurations of eyewitness testimony strength were chosen to test for the existence of similarity, attraction, and compromise effects.

Instead of just averaging, however, Liew and colleagues used a hierarchical Bayesian clustering method to groups of similar participant choice behaviour. The figure below shows what a difference this can make. The average proportion of participants selecting the three options in the test of the compromise effect is shown on the left. A compromise effect would exist if the “focal” (the compromise stimulus) is chosen more than the “nonfocal” (the second option) or the decoy (the extreme option). There doesn’t seem to be a compromise effect; each “suspect” appears to be selected as guilty with roughly equal probability.

*Figure 2 in the featured article (Liew et al.)*

The results of the cluster analysis of individual participants is shown on the right. Liew and colleagues identified six clusters of choice behavior. In one cluster — cluster 3 — a strong compromise effect is observed. Participants in cluster 2 favored one of the two extreme options, while participants in clusters 1 and 4-6 favored the other. Importantly, none of the clusters look like the average. As in our coffee/tea example, the average choice behaviour tells us very little about any person’s actual decision behaviour.

Liew and colleagues emphasize the importance of accounting for individual differences in studies of choice behaviour. Models of decision making are evaluated on the basis of whether they exhibit the patterns seen in experiments. If models of decision making are designed to account for data patterns that no-one actually displays — that are artifacts of averaging — then these models will not represent true understanding of decision making behaviour. Averaged data patterns can be robust and replicable, but ultimately meaningless.

As a reviewer, I have seen problematic averaging practices in many papers and pointed out the averaging problem in review after review, only to have editors accept papers based on averaged data because it is “standard practice”. This practice sets back the progress in the field, both methodologically and theoretically. I hope Liew and colleagues’ warning is heeded in the choice modeling literature, and beyond.