Imagine an experiment in the psychological laboratory. In the experiment, some number of participants are asked to solve problems of varying difficulty. Crucially, the participants are unique individuals and not faithful copies of one another.
If that bit of fiction sounds familiar, that might be because it describes a good chunk of cognitive science and psychological research. Much as we would like to study a single Platonic human to describe their inner workings with mathematical precision, we must confront the reality that our experimental participants are flawed projections on the cave wall. They differ from one another in almost every respect, from being more attentive, biased, or cautious to more xenophobic, younger, and more—uh—zesty, maybe?
In the face of stubborn diversity, empirical researchers have widely agreed on a two-pronged strategy in which first a representative sample from a population is drawn, and then results from such a sample are averaged over participants in the hopes that interindividual differences will largely cancel one another out.
It turns out, however, that “averaging over participants” can mean many things—more than just taking the arithmetic mean of participant-level data. There are many cases where such naive averaging is clearly a wrong strategy. Consider the figure below. In the top panel, we see error curves for five participants of different ability. For each participant, the error rate increases with problem difficulty at an identical rate. The middle panel shows the result of naive averaging: the error rate across the sample of participants for each level of difficulty—the five curves are, by definition, averaged vertically. This average curve looks nothing like the individual curves! If we were to take the middle panel as a summary of the sample, we would conclude that the relationship between problem difficulty and error rate is much shallower. Due to this aggregation error, the curve does not share a critical interesting property of the individual curves.
Now consider the bottom panel, in which the curves have been aggregated horizontally rather than vertically. For each curve, I found the difficulty that corresponds to a 10% error rate and I averaged those difficulties. Then I did the same for the 30%, 50%, 70%, and 90% error rates. Those averaged difficulties are given by the empty circles in the bottom panel. Then I did the same for a few dozen other error rates to fill out the curve. This horizontal aggregate looks a lot more representative of the five original curves (in fact, it is exactly the curve of the participant in the middle). It reproduces exactly the relationship between difficulty and error within a participant. Horizontal aggregation seems like a better choice here.
Error curves and learning curves are easy examples, but cognitive scientists often deal with a much more complicated type of curve: probability distributions. The most typical example of that is the response time (RT) distribution. There, too, naive aggregation can lead to unintelligible summaries and artifacts, and this can be avoided to some extent with horizontal aggregation methods. Formally described by cognitive scientist Roger Ratcliff of diffusion model notoriety, this method is now often called “vincentization” or vincentizing and the aggregate quantiles are called vincentiles. It is called vincentizing, by the way, in honor of the biologist S. B. Vincent who did something similar in 1912.
Vincentization works the same way as the horizontal aggregation in the previous example. Consider the figure below, in which the top part shows the RT distributions of 11 participants. Clearly, these 11 are not unanimous. Within each distribution, the 10th percentile is marked with a dark blue triangle, and the 30th, 50th, 70th, and 90th percentiles in other colors. Below the curves, I have collected the individual percentiles (upward pointing triangles). Those percentiles are then averaged into so-called “vincentiles” (diamonds).
Vincentization has been popular in RT research because the histograms implied by the vincentiles look more like a representative member of the sample of histograms that the 11 individuals provided. The naive vertical average (squares) shows a very different distribution.
However, the apparent advantage of vincentization, namely that it preserves interesting properties of the individual-level distributions, may be largely illusory. For example, cognitive model parameters estimated from vincentiles are not better estimates than those coming from more mundane methods. Indeed, the effectiveness of the method is limited to some relatively simple cases.
A more modern method for dealing with individual differences is hierarchical Bayesian modeling of response times or choice response times. Hierarchical Bayesian cognitive models apply the psychometric logic of random effects to model parameters (allowing such models to explain variability among participants, items, groups, etc.), and thereby effectively solve the aggregation problem for empirically observed curves.
Another potential application of vincentization is in the analysis of time courses. If an experiment yields a RT on each trial and some associated dependent variable (e.g., accuracy) on that trial, we could graph the evolution of accuracy over RT.
To aggregate over different participants’ time courses, we could collect their responses in temporal bins and vincentize the distribution within each bin over participants. Unfortunately, such a procedure risks destroying any temporal patterns that occur out of phase because the aggregation would occur over qualitatively unlike timepoints—imagine two waveforms in counterphase being averaged; the aggregate would retain none of the waveform.
In a recent article in the Psychonomic Society’s journal Attention, Perception, & Psychophysics article, Jonathan Van Leewen, Jeroen Smeets, and Artem Belopolsky introduce a novel method dubbed “smoothing method for analysis of response time-course” or SMART. In the SMART method, a signal is temporally smoothed using kernel density estimation; then a “vertical” weighted aggregate is formed. With its associated inferential statistics, the method could be used to test chronometric functions for changes and deflections or differences between conditions. The authors provide computer code as well as examples of the SMART method in which they revisit the time course of behavior in saccade tasks.
The key step in this method is the very first one. The kernel smoothing operation can turn a dependent variable (e.g., accuracy) sampled at irregularly spaced time course points into a smooth, continuous time course function that can be evaluated at any point. That is, if the observed RTs are 200 ms, 300 ms, and 350 ms, and the associated responses are correct, correct, and incorrect, the smoothing allows one to interpolate the expected accuracy at 250 ms and any other time point. Once that function is available, it becomes possible to average this curve across participants at any time point. The figure below shows both steps. Given some desired time course point t, we interpolate the dependent variable at that time point for each participant. The reliability of this interpolation depends on how many observations a participant had near t, and the averaging takes this into account.
As the authors point out, however, the SMART method has some remaining weaknesses. For example, the smoothing process requires some fine-tuning—with too wide a smoothing kernel it is possible that some high-frequency patterns are erased. On the other side of the spectrum, it is possible that low-frequency qualitative patterns that exist within participants are averaged out if there are individual differences in the temporal scale of the time course.
Ultimately, it is important that the analytical decisions that go into data aggregation be made carefully and judiciously. As in all complex analyses, it may be worth considering sensitivity analyses to evaluate the robustness of one’s conclusions to these sorts of analytical choices.
Psychonomics article featured in this post:
van Leeuwen, J., Smeets, J. B., & Belopolsky, A. V. (2019). Forget binning and get SMART: Getting more out of the time-course of response data. Attention, Perception, & Psychophysics. DOI: 10.3758/s13414-019-01788-3.