Prioritizing inclusivity does not sacrifice reliability

When I first dove into research at my current laboratory, I inherited a grab bag of R code snippets, shell scripts, tasks, paradigms, and measures. I’m sure I am not alone in this experience. Using existing tools and frameworks in research is common practice—it’s often more efficient and can yield better outcomes than building everything from scratch. However, what happens when these inherited resources fall short?

While many lab- or field-specific standards serve their intended purposes, using them without question may inadvertently limit what we can learn by using these measures. Heesu (Ally) Kim et al. (2024) make this exact point in a recently published Behavior Research Methods paper.

“Our science can only be as good as our measures.”

They urge us all to reconsider what truly constitutes a “good” measure in cognitive science.

A digital illustration of a seesaw with the word “generalize” on the left side and the word “control” on the right side of the seesaw. In the center is a turquoise ball with the word “measure” written on it. The background is a pleasant light blue. — There is an assumed balance between the tightness of the control and the ability of that measure to generalize widely. Perhaps there is less of a trade-off than originally considered? Illustrator: Hannah Mechtenberg.

In the cognitive sciences, there’s a prevailing belief in the inherent tradeoff between control and generalizability. Conventional wisdom suggests that loosening control of a myriad of variables weakens the consistency (i.e., reliability) and validity (i.e., confidence that it measures what it’s intended to) of a tool or task—both critical aspects of its effectiveness. Unfortunately, tightly controlling the stimuli, experimental conditions, and participant demographics often limits our ability to generalize our findings. This assumed empirical teetertotter forces researchers to balance experimental control and generalizability as they design and use various cognitive tools.

However, it’s plausible that this balancing act is less precarious than we think. Luckily, the notion that increased variation always leads to worse validity and reliability is a testable one—prompting an intriguing question: Can a tool that prioritizes inclusivity and representation, resulting in more variation in its materials, have comparable outcomes to a less representative and highly used tool?

An array of photographs of the six authors of the paper that is the subject of the blog post. There are three photos in the top row and three photos in the bottom row. Everyone is looking at the camera and smiling. — Authors of the featured article “Multiracial Reading the Mind in the Eyes Test (MRMET): An inclusive version of an influential measure,” Starting top row: Heesu (Ally) Kim (left), Jasmine Kaduthodil (center), Roger Strong (right). Bottom row: Laura Germine (left), Sarah Cohan (center), and Jeremy Wilmer (right).

Heesu (Ally) Kim, Jasmine Kaduthodil, Roger Strong, Sarah Cohan, Laura Germine, and Jeremy Wilmer (pictured above) present the newly minted Multiracial Reading in the Minds Eye Test (MRMET); an updated version of the widely cited Reading in the Minds Eye Test (RMET). The updated test prioritizes inclusivity across multiple dimensions, including race, gender, and vocabulary level. Recently published in Behavior Research Methods, comparisons between the new test (MRMET), existing test (RMET), and other cognitive measures show that the Multiracial Reading in the Minds Eye Test meets and exceeds the standards set by its predecessor. These findings demonstrate that boosting the inclusivity of test materials does not compromise their reliability or validity.

About the new test, the authors wrote,

“[The MREMT] provides a touchstone for working toward, and demanding, greater inclusivity in scientific measures of human behavior more broadly.”

The original Reading in the Minds Eye Test evaluates a person’s ability to interpret facial expressions and engage in ‘theory of mind,’ or the ability to think about another person’s thoughts, perspectives, emotions, and goals. This ability—to read people and their inner states—is important for effective social interactions, and thus this test is a mainstay in the fields of social cognition and autism research.

The test itself is deceptively simple. Participants are shown a rectangular cutout of a person’s face, centered on their eyes, on a screen. They are then asked to choose from a set of four options, each describing that person’s potential mental state. Options can range from ‘relaxed’ to ‘cautious about something over there.’

The original Reading in the Minds Eye Test has high levels of reliability and validity, and it distinguishes itself from other tests of reading facial expressions by avoiding common ceiling effects. However, it is not without its limitations. One notable drawback is its lack of diversity in facial representation, as there are exclusively White faces and the female faces wear heavy makeup. There are also gender-stereotyped target states (‘desire’ for female faces and ‘assertive’ for male faces). Additionally, the word choices for the options require a high vocabulary level, which may present a barrier for some participants. These limitations leave many open questions about the generalizability of the test outcomes to faces of diverse races, less gender-stereotyped physical and emotional characteristics, and its validity across participants with varying vocabulary levels.

The leftmost column of this figure includes example stimuli (grayscale photos of people’s eyes) from the original RMET measure as well as potential response options for the emotional state of that person including “embarrassed” “fantasizing” “confused” and “panicked.” The rightmost column of this figure includes the stimuli and response options for the updated MRMET measure which highlights the racial diversity of the people as well as the reduced vocabulary level of the response options, which are: “excited” “embarrassed” “interested” and “sarcastic.” — *Examples of the stimuli and response options for the original Reading in the Minds Eye Test (RMET; left) and the updated Multiracial Reading in the Minds Eye Test (MRMET; right).*

In the updated Multiracial Reading in the Minds Eye Test, the four aforementioned aspects were altered to be more representative and inclusive. The authors iteratively honed the materials in the new test through systematic comparison with the original test—ensuring that the new items correlated highly with the items in the RMET, an indication of validity. Once a candidate set of items was selected, they distributed the test to a diverse sample of over 8,000 participants spanning ages 12 to 89 years. They evaluated its internal reliability (how well responses on one-half of the test correlate with the other half), how it correlated with vocabulary knowledge, and social skills (assessed via the Autism Spectrum Communication questionnaire).

Across the board, the new test was shown to be either on par with or surpassed the original RMET:

The Spearman correlations of the measures’ split-half reliability were statistically identical between the two tests.
The updated test correlates less strongly with vocabulary than the original test indicating less reliance on vocabulary to perform well on the test.
The more representative measure correlated just as strongly with the questionnaire on social skills as did the original test.

The MRMET is a much-needed update to a valuable measure and demonstrates that we can all work towards creating excellent tools, tasks, paradigms, and measures that are infused with equitable representation and inclusivity. Which, sometimes, means taking a good measure and devoting the time to making it even better. You can find the MRMET materials here: https://osf.io/ahq6n/

Innovation and iteration are cornerstones of science. Representation and inclusion are critical ingredients to ensure that cognitive science truly captures the full range of human variation. After all, one of the primary goals in cognitive science is to take what we learn from our often tightly controlled research environment and explain the behavior of real people out in the world. Perhaps we should all question the standards in our field more often and ask, “Is this the best we can do?”

Featured Psychonomic Society article

Kim, H., Kaduthodil, J., Strong, R. W., Germine, L., Cohan, S., & Wilmer, J. B. (2024). Multiracial Reading the Mind in the Eyes Test (MRMET): An inclusive version of an influential measure. Behavior Research Methods. https://doi.org/10.3758/s13428-023-02323-x

Prioritizing inclusivity does not sacrifice reliability

Featured Psychonomic Society article

You may also like

When you could be sure that the submarine is yellow, it’ll frequentistly appear red, blue, or green

How accurate does the tortoise have to be to beat the hare?

#PSBigData Better than Gold: Unlike gold, big (basketball) data can be mined repeatedly by multiple methods