Automatic detection of automatic response generators: How to improve data quality in online research

In recent years, researchers have started using Amazon Mechanical Turk and similar services to collect data from online participants. Two big benefits are the speed and ease of data collection. A study that might take a year to run using a participant pool at a small university could now be completed in a day, and with no reliance on research assistant time or lab space. We have addressed issues with online testing, in particular using MTurk, several times on this blog, for example here, here, and here.

There are also other benefits to collecting data online. Traditional psychology participant samples are WEIRD (western, educated, industrialized, rich, democratic), so testing online allows researchers to get samples that may be more representative of the worldwide population. A recent post on this site addressed the problems with WEIRD participants, and the need to de-WEIRD our samples.

But, there are also drawbacks. The data collected has more noise, since people vary in the state and environment they’re in as they complete the task. They might be distracted and do the task in the background as they work or hang out at home. They might click through the task randomly, trying to finish as quickly as possible. Or, they might even use automaters (e.g., the Google Chrome Form Filler plug-in) to randomly select responses without having to click on each one.

So, what’s a researcher to do?

Screening for low-quality responses

There are a variety of ways to identify and screen out low-quality survey and task responses. Methods include removing data from people who completed a task too fast, using catch trials (correctly responding to items like “select the second option for this question”, or reading comprehension questions), or identifying people who frequently changed their responses or clicked the mouse more than would be expected.

However, no one method of data screening may work reliably across studies, and different methods will screen out different numbers of respondents.

To help fellow researchers be more systematic in their data screening, Erin Buchanan and John Scofield developed a tool that puts together multiple data-screening methods. Their tool is described in a recent article in the Psychonomic Society’s journal Behavior Research Methods and is available to use in R and even online through an internet browser.

So, how did they create it?

It takes a form filler to know a form filler

First, Buchanan and Scofield developed ways to automatically detect data produced by using automated form fillers. They used the popular Form Filler plug-in to fill in a 100-question 7-option survey over and over and over.

Here are the tell-tale signs they noticed. First, the click counts for survey completion were zero. The form filler didn’t “click” each answer as it filled it in. Second, the survey responses were really fast. A survey with a hundred questions was submitted within a few seconds. Finally, the survey response distribution was uniform – each of the seven question options was equally likely to be chosen.

Identifying automated, low effort, and high effort responses

Armed with this knowledge, Buchanan and Scofield conducted a study to distinguish automated, low effort, and high effort human responses.

Participants were asked to fill in a survey in three ways: high effort (read and answer each question), low effort (select random answers at their own pace), and using an automated form filler. The survey included an attention check, and the researchers also measured click counts, page timing, the use of answer choices, and the distribution of answer choices as possible indicators of low data quality.

Only 3% of high effort responses failed the manipulation check, but most of the low effort and form filler responses failed. Answer choices were also useful in identifying high effort responses: the range of answers used by high effort responses tended to smaller.

Click counts were useful in distinguishing form filler responses from low and high effort responses: on a 15-item-survey, the average click count for form filler responses was only 1.18.

Page submit times varied between the types of responses, too. Most low effort and automatic responses were faster than would be expected considering reading time for the survey, while most high effort responses were not.

Using a cut-off of two or more failed indicators (low click count, short page times, few answer choices, uniform answer distribution, failed attention check) Buchanan and Scofield could identify 100% of the automated responses, 99% of the low effort responses, and only excluded 2% of high effort responses.

Moving to the field

The researchers then administered the same survey to over 1,000 Amazon Mechanical Turk participants to see how they would fare on the five indicators of low data quality.

Here is what they found:

Around 2% of responses had low click rates
Over half of the responses were faster than would be expected based on reading time
Around 20% responses used more answer options than would be expected
Around 2% responses had a uniform distribution of answers
Around 5% of responses failed the manipulation check

In total, 14% of the responses were identified as low quality using the cut-off of two or more failed indicators.

Now what?

Where one researcher might screen out one in twenty responses using an attention check, another might screen out one in two responses using a time cut-off. By using the same set of indicators, there could be more consistency in data screening across studies.

Buchanan and Scofield created a tool to screen out low quality data using the five indicators described above. You can read more here or watch a brief video on how to use it:

And if you are planning a study online, try to collect about 15% more participants than planned to account for the random clickers and automated form fillers.

Psychonomic article focused on in this post:

Buchanan, E. M., & Scofield, J. E. (2018). Methods to detect low quality data and its implication for psychological research. Behavior Research Methods. DOI: 10.3758/s13428-018-1035-6.

Author

Anja Jamrozik

Anja Jamrozik is a behavioral scientist and consultant working to improve the design of our built environment. She is currently a consultant at a lab dedicated to understanding the interaction between health and well-being and indoor environments, where she tests the environment's impact on people: their cognitive function, productivity, feelings, comfort, and well-being. Anja received her B.Sc. in Psychology and Cognitive Science from McGill University and her Ph.D. in Cognitive Psychology from Northwestern University, where her research focused on higher-order cognition, including analogical reasoning, metaphoric comparison, and spatial and relational language. She later completed a postdoctoral research fellowship in Cognitive Neuroscience at the University of Pennsylvania, where her research focused on the development of abstract concepts, diversity in people’s use of spatial and relational language, and real-world consequences of aesthetic preferences.
View all posts