Is a picture necessarily worth a thousand words? Do bilinguals always find some grammatical features in their second language to be more difficult than native speakers of that language? Is the Stroop effect necessarily larger when the task is to name the color ink of a color word than when the task is to read that word?
Pretty much any experiment conducted by researchers in cognitive science will involve a comparison between at least two groups or two conditions to which participants are exposed. That comparison will necessarily involve some form of statistical test, be it a frequentist or Bayesian test. There is almost no escaping the use of a test, because even confidence intervals really are variants of a statistical test, albeit with properties that many researchers do not necessarily understand.
So let us consider statistical tests.
Even the most basic of tests, for example the t-test that was invented to monitor the quality of Guinness stout (with great success, in my view), rests on various assumptions. For example, the data must be sampled independently from two normal distributions (in the two-sample case we are concerned with here) with, ideally, equal variances.
What happens if we violate those assumptions? Can we avoid those assumptions altogether?
The answer to the first question is nuanced, but in some cases it is “not much”. Almost 60 years ago, Alan Boneau published what I believe to be the first Monte Carlo experiment on the properties of the t-test. Let us look at his methodology in some detail because it will also help us understand the answer to the second question.
A Monte Carlo experiment relies on simulating a process or procedure by sampling of random numbers—hence the name. Because we can control the exact nature of those random numbers, and because we know exactly how they were sampled, we can use Monte Carlo techniques to gather insight into the behavior of statistical tests. In a nutshell, we create a situation in which we know with 100% certainty that something is true—for example, we may know that the null hypothesis is true because we sample some random numbers from two populations with identical means and variances.
Suppose we do precisely that (with 15 observations per group, let’s say) and then conduct a t-test. What’s the expected probability of us finding a significant difference between our two samples? Exactly, it’s .05 (assuming we set our alpha level to .05).
Now suppose we repeat that process 1,000 times and count the actual number of times the t-test is significant. We would expect to count around 50 such events, each of which represents a dreaded Type I error, give or take a few because the process is random.
Enter the important question: what happens if we violate the assumptions and we sample from, say, two uniform distributions instead of the normal distributions as required by the t-test? What if we introduce inequality between the variances?
Boneau explored several of those potential violations. Here is what he had to say:
“it is demonstrated that the probability values for both t, and by generalization for F are scarcely influenced when the data do not meet the required assumptions. One exception to this conclusion is the situation where there exists unequal variances and unequal sample sizes. In this case the probability values will be quite different from the nominal values.”
That’s fairly good news because we do not have to lie awake at night wondering whether our data really are normally distributed. But there is that rather large fly in the ointment, namely that our presumed Type I error level is not what we think it is when we have unequal sample sizes (often unavoidable) and the variances between our two groups are different (also often unavoidable).
This brings us to the second question, can we avoid those assumptions altogether? Could we perform comparisons between conditions without having to worry about, well, anything really?
A recent article in the Psychonomic Society’s journal Behavior Research Methods addressed this question and introduced a new method for statistical comparisons that does not make any assumptions about how the data are to be modeled. Researchers Bommae Kim and Timo von Oertzen based their technique on an algorithm developed by artificial-intelligence researchers known as a Support Vector Machine (SVM).
In a nutshell, an SVM learns from examples how to assign labels to objects. The range of applications of SVMs is incredibly broad: SVMs can learn to detect fraudulent credit card transactions by examining 1,000s of credit card activities for which it has already been established whether they are fraudulent or nonfraudulent. SVMs can learn to recognize hand-writing by learning from a large collection of images of handwritten digits or letters. And the list goes on.
How does an SVM do this?
The figure below, taken from an excellent primer on SVMs, shows the simplest possible example. The data are taken from genetics in this instance, but the same principle applies to any other data set consisting of two groups (in this case the green vs. red dots that are separated along two dimensions of measurement).
The panel on the left shows the data, including one observation whose group membership is unknown (the blue dot). The panel on the right shows the “hyperplane” (in this case a line) that the SVM learns to arrange in a way that optimally differentiates between the two clusters. The unknown observation is now clearly identified as belonging to the red cluster.
Unlike a t-test, the SVM does not make any assumptions about the nature of the data: it simply seeks to differentiate between two clusters as best it can. If the SVM can assign group membership more accurately than expected by chance, then it has successfully learned the difference between the two groups. Crucially, it can only do so if there is a discernible difference between the two groups. In the above figure, if the red and green dots were randomly intermixed, this difference could not be learned and classification of unknown test cases (i.e., the blue dot) would be at chance. (In reality, an SVM does a lot more than drop a line in between two clusters; this tutorial provides a good introduction.)
So here, then, is the SVM equivalent of a t-test: two groups differ on one (or more) measure(s) if the machine can learn to assign unknown cases with above-chance accuracy. The unknown cases are simply those that the SVM is not trained on: this simply means we leave out some subset of the observations during training and then seek to predict the group membership of those “unknown” items after training. To maximize power, each observation can take a turn across multiple applications of the SVM to play the role of a single “unknown” observation.
Kim and von Oertzen reported multiple Monte Carlo experiments to demonstrate the utility of the SVM as a statistical analysis tool.
The simplest experiment is sketched in the figure below. Each panel contains two distributions, assumed to represent two different groups in an experiment. The two distributions differ either in terms of means only (panel a), or only variances (b), or shape (c), or all of the above (d).
The next figure shows the results of this experiment. All cell entries refer to the proportion of times that the tests yielded a significant difference between the two groups. The top row (Condition 1) refers to the situation in which the null hypothesis was perfectly true and groups differed neither in mean (M), nor shape or variance (SD). The entries for that row therefore reflect Type I errors, and it can be seen that both the t-test and the SVM were close to the expected .05 level.
Now consider the remaining rows of the table. Although the t-test was more powerful than the SVM when groups differed only in mean (.31 vs. .13) or in mean and variance (.27 vs. 15), the SVM outperformed the t-test in all other situations, in particular those involving a difference in shape between conditions.
Across a number of further conditions, including an experiment involving multivariate measures, Kim and von Oertzen observed that
“the SVMs showed the most consistent performance across conditions. Moreover, SVMs’ power improved when group differences came from multiple sources or when data contained multiple variables.”
The SVM is therefore particularly useful if the research question is to find any kind of differences between groups, whereas conventional methods are more useful if the focus is on specific differences between groups. Kim and von Oertzen argue that the search for any differences can be crucial in clinical applications or program evaluations, for example to ascertain that control and treatment groups do not differ on multiple measures after randomization.
Psychonomics article highlighted in this blogpost:
Kim, B., & von Oertzen, T. (2017). Classifiers as a model-free group comparison test. Behavior Research Methods. DOI: 10.3758/s13428-017-0880-z.