Out-thinking sub-optimal survey responders

It never ceases to amaze me the lengths people will go to “outsmart” a system – whether it is homework, a test, an insurance claim, a speeding ticket, a secure file, or a survey. Because humans engage in “sub-optimal” behavior (aka careless, insufficient effort, or deception), survey research is especially vulnerable and must guard against this “misbehavior” constantly.

“Tell Me Lies, Sweet Little Lies” (Fleetwood Mac)

Thinking about “sweet little lies” began as early as the late 1920s when Hartshorne and May published a book on human character with an emphasis on deceit in which they identified methods by which humans engaged in deceptive behavior. As instruments developed, the task to control for “intentional and unintentional inaccuracies” in responses became more intentional! For example, in an attempt to measure social adjustment for children, Washburne added questions to his objectivity scale to control for “sub-optimal” responses. Some of these questions included endorsing whether one had broken something before or was always on time to school or appointments, and response tendencies were used to evaluate the validity of the outcomes from the survey.

“X” Marks the Spot (Karmic)

Over time, these “lie” detection scales have been implemented in many types of self-report data. In today’s world, some of these questions are in the form of direct instruction like “Mark A for this question” or “Select the type of animal in this picture” (spider), while others are based on “frequency” or “infrequency” of endorsing certain items. “Frequency” items are those that most people would agree with the statement (more than 90% would agree that they are younger than their parents) while “infrequency” items are those that few people would agree with the statement (less than 10% would agree that they can run 2 miles in 2 minutes).

“Choose Wisely” (Darren Waller)

This “misbehavior” issue has multiple consequences on the data. Not only does it create additional factors when understanding the data, but it can also increase or decrease observed effect sizes (the numerical indicator of the magnitude of the relationship or difference between the variables of interest). Thus, it is incumbent on researchers to be able to detect the presence of these careless or insufficient-effort responders.

There are multiple techniques for detecting these types of responses. Duration of survey completion time is often evaluated with too short being suspect; too long is more varied as it may simply be a respondent who did not close out the survey after completing it or left it open but did not attempt or finish it. When duration time is coupled with the percentage of the survey completed, some insight can be attained into the potential usefulness of the data, such as a survey with a very long duration and 10% completed is most likely of little value. Another technique is to evaluate the variability in the responses of the survey taker, which can be verified when combined with an assessment of the number of same responses to items in a row (more items in a row with the same response, the more suspect the data). Today, cursor movement can also provide some indication of the quality of the data being gathered.

“Seeking the Truth” (Spriggan, Taisei Iwasaki)

In the study focused on in this post, researcher (and creator of the scales) Cameron Kay (pictured below) examined a variety of validity outcomes for two different scales in an effort to validate responses that could be characterized as careless or insufficient effort (because the truth of the matter is that sometimes people engage in inauthentic behavior). Validation of the two scales had not been conducted previously so this study was critical in assessing their effectiveness.

*Image of the author of the featured article, Cameron Kay.*

Both scales work with a 5-point Likert scale that is reverse scored for some of the items and then added together so that higher scores indicate careless or inefficient effort. The two scales vary in the presentation style of the items – statements versus adjectives. The Invalid Responding Inventory for Statements (IDRIS) has 14 statements that are divided into seven infrequency statements (“I am older than my parents.”) and seven frequency statements (“I can remember the names of most of my close family members.”). The Invalid Responding Inventory for Adjectives (IDRIA) has six adjectives with three infrequency adjectives (e.g., “triangular”) and three frequency adjectives (“mortal”).

In a series of six studies, Kay tested multiple forms of validity with replication. Both scales were tested for validation with 11 previously validated indices, including (1) response duration (short is not good), (2) longest string of identical responses (long is questionable), (3) response variability (minimal is concerning), (4) the association between a respondent’s responses and the average respondent’s responses (low is problematic), (5 & 6) psychometric-synonyms and -antonyms indices (small correlations are red flags), (7) fake email addresses (why lie? But some were creative), (8) a self-reported item of whether or not the data are trustworthy (a yes or a no – no need to lie) among (9-11) several others.

“The Final Countdown” (Europe)

The findings of all six studies verified that the scales could effectively detect folks who were careless or putting less effort into their responses. As the results from Study 2 with 700 undergraduate students show in the series of scatterplots presented below, the projected relationships between each scale (IDRIS & IDRIA) and six typically tested validation measures were supported. The six measures presented consecutively in the figure below correspond to the numbers in the description of the measures above (i.e., 1 = Duration). For example, duration was expected to be shorter for careless and insufficient-effort responders, which is what is observed in the first plots of the figure below for both the IDRIS (A column) and the IDRIA (B column).

12 scatterplots — Scatterplots of relationships between each scale (IDRIS & IDRIA) and six of the previously validated scales from the second study with 701 undergraduate students. More positive values on the scale score axis (IDRIS/IDRIA) represent careless or insufficient effort responders. Scores on the y-axis are presented based on the individual scale for each validation measure tested (1-6 as described above).

These results were replicated across each study conducted using different types of samples – undergraduate students from different universities, general population folks recruited from the US, India, and Nigeria via Qualtrics Panels, a vetted online survey company, and other members of the general population who are part of the MTurk worker database. Moreover, the additional validation measures in which email addresses and a self-rating of the usefulness of the data submitted also supported the detection of the “misbehaving” respondents. I find it quite ironic that these respondents are willing to be truthful on a question that ultimately calls themselves out after spending so “much” time being inauthentic! Although, I suppose if we couldn’t depend on people being careless or unwilling to provide genuine responses then we wouldn’t have so much fun designing items to catch them.

“To Be Free” (Dylan Gossett and VALID!)

Both scales are part of an online database of over 660 items that is open to any interested party – Comprehensive Infrequency/Frequency Item Repository (https://cifr-project.org/portal.html). The scales are also nonproprietary, which means the items can be modified as researchers wish. The scales are also designed to limit attention to the items so as to not tip off the respondent as an attention check. Finally, these scales minimize a number of other issues found in previous scales, including an imbalance in the number of frequency and infrequency items and conspicuous linguistic features such as proper nouns, uncommon words, numbers, or unusual punctuation. Ultimately, the results of this study demonstrated strong validity characteristics for both scales, and researchers should take advantage of these tools that can help control the influence of respondents’ “misbehavior”.

Cue “Gonna Fly Now” from Rocky by Bill Conti.

Featured Psychonomic Society article

Kay, C. S. (2024). Validating the IDRIS and IDRIA: Two infrequency/frequency scales for detecting careless and insufficient effort survey responders. Behavior Research Methods, 1-24. https://doi.org/10.3758/s13428-024-02452-x

Author

Heather Manitzas Hill

Heather Hill is a Professor at St. Mary’s University. She has conducted research on the mother-calf relationship and social development of bottlenose dolphins in human care. She also studied mirror self-recognition and mirror use in dolphins and sea lions. Most recently, she has been studying the social behavior and cognitive abilities of belugas, killer whales, Pacific white-sided dolphins, and bottlenose dolphins in human care. She has also been known to dabble in various aspects of human cognition and development, often at the intersection of those two fields.
View all posts