In recent years, researchers have started using Amazon Mechanical Turk and similar services to collect data from online participants. Two big benefits are the speed and ease of data collection. A study that might take a year to run using a participant pool at a small university could now be completed in a day, and with no reliance on research assistant time or lab space. We have addressed issues with online testing, in particular using MTurk, several times on this blog, for example here, here, and here.
There are also other benefits to collecting data online. Traditional psychology participant samples are WEIRD (western, educated, industrialized, rich, democratic), so testing online allows researchers to get samples that may be more representative of the worldwide population. A recent post on this site addressed the problems with WEIRD participants, and the need to de-WEIRD our samples.
But, there are also drawbacks. The data collected has more noise, since people vary in the state and environment they’re in as they complete the task. They might be distracted and do the task in the background as they work or hang out at home. They might click through the task randomly, trying to finish as quickly as possible. Or, they might even use automaters (e.g., the Google Chrome Form Filler plug-in) to randomly select responses without having to click on each one.
So, what’s a researcher to do?
Screening for low-quality responses
There are a variety of ways to identify and screen out low-quality survey and task responses. Methods include removing data from people who completed a task too fast, using catch trials (correctly responding to items like “select the second option for this question”, or reading comprehension questions), or identifying people who frequently changed their responses or clicked the mouse more than would be expected.
However, no one method of data screening may work reliably across studies, and different methods will screen out different numbers of respondents.
To help fellow researchers be more systematic in their data screening, Erin Buchanan and John Scofield developed a tool that puts together multiple data-screening methods. Their tool is described in a recent article in the Psychonomic Society’s journal Behavior Research Methods and is available to use in R and even online through an internet browser.
So, how did they create it?
It takes a form filler to know a form filler
First, Buchanan and Scofield developed ways to automatically detect data produced by using automated form fillers. They used the popular Form Filler plug-in to fill in a 100-question 7-option survey over and over and over.
Here are the tell-tale signs they noticed. First, the click counts for survey completion were zero. The form filler didn’t “click” each answer as it filled it in. Second, the survey responses were really fast. A survey with a hundred questions was submitted within a few seconds. Finally, the survey response distribution was uniform – each of the seven question options was equally likely to be chosen.
Identifying automated, low effort, and high effort responses
Armed with this knowledge, Buchanan and Scofield conducted a study to distinguish automated, low effort, and high effort human responses.
Participants were asked to fill in a survey in three ways: high effort (read and answer each question), low effort (select random answers at their own pace), and using an automated form filler. The survey included an attention check, and the researchers also measured click counts, page timing, the use of answer choices, and the distribution of answer choices as possible indicators of low data quality.
Only 3% of high effort responses failed the manipulation check, but most of the low effort and form filler responses failed. Answer choices were also useful in identifying high effort responses: the range of answers used by high effort responses tended to smaller.
Click counts were useful in distinguishing form filler responses from low and high effort responses: on a 15-item-survey, the average click count for form filler responses was only 1.18.
Page submit times varied between the types of responses, too. Most low effort and automatic responses were faster than would be expected considering reading time for the survey, while most high effort responses were not.
Using a cut-off of two or more failed indicators (low click count, short page times, few answer choices, uniform answer distribution, failed attention check) Buchanan and Scofield could identify 100% of the automated responses, 99% of the low effort responses, and only excluded 2% of high effort responses.
Moving to the field
The researchers then administered the same survey to over 1,000 Amazon Mechanical Turk participants to see how they would fare on the five indicators of low data quality.
Here is what they found:
- Around 2% of responses had low click rates
- Over half of the responses were faster than would be expected based on reading time
- Around 20% responses used more answer options than would be expected
- Around 2% responses had a uniform distribution of answers
- Around 5% of responses failed the manipulation check
In total, 14% of the responses were identified as low quality using the cut-off of two or more failed indicators.
Now what?
Where one researcher might screen out one in twenty responses using an attention check, another might screen out one in two responses using a time cut-off. By using the same set of indicators, there could be more consistency in data screening across studies.
Buchanan and Scofield created a tool to screen out low quality data using the five indicators described above. You can read more here or watch a brief video on how to use it:
And if you are planning a study online, try to collect about 15% more participants than planned to account for the random clickers and automated form fillers.
Psychonomic article focused on in this post:
Buchanan, E. M., & Scofield, J. E. (2018). Methods to detect low quality data and its implication for psychological research. Behavior Research Methods. DOI: 10.3758/s13428-018-1035-6.