High quality MTurk data

In 2005 Amazon launched the Mechanical Turk platform (MTurk), a marketplace where requesters can pay “workers” from all over the world to complete tasks over the internet. MTurk is used to crowdsource many tasks that are still best completed by aggregate human intelligence (as opposed to machine intelligence), such as rating the relevance of search results, flagging inappropriate photos, or transcribing text from blurry images.

MTurk also incidentally provides an additional way for psychology researchers to gather data from human participants. Not all psychology studies can be conducted over the internet (e.g., those using eye-tracking or requiring very precise timing), but for those that can, MTurk is an extraordinarily powerful resource for our science.

We can now run hundreds of participants in a matter of mere days (or even hours!), at a lesser cost and without requiring any physical lab space or the subject pool of a large university. This changes the workflow of research. There’s no more waiting around a semester or two for data to trickle in. If you’ve got the programming skills, IRB approval, and some funds, you could get the data to answer your research question by the end of the week. This creates the potential for more rapid iteration of follow-up studies, and/or more time for tasks like data analysis and writing.

MTurk also expands the kinds of research questions we can feasibly go after. Chasing down a small effect size that requires more power? Need to norm a ton of new stimuli? Got sixteen between-subjects conditions? No problem!

Of course, the new possibilities come with new concerns. First and foremost: are the participants actually doing your task? When you run participants in a physical lab, you can at least glance at them and see if they’re asleep and/or playing with their phone instead of doing your research task. When you run participants using MTurk, you have no such luxury; plus, participants are often working from the uncontrolled environment of their own homes. Some of them could be distracted or goofing off, polluting your data with noise!

This issue is addressed in a new article by Peer, Vosgerau, and Acquisti in the Psychonomic Society’s journalBehavior Research Methods. In two experiments, Peer et al. explored two ways to approach the potential problem of inattentive MTurk participants: the use of “attention check questions” (ACQs) to both deter and detect inattention, and simply restricting participation to workers who have gained high reputations for their previous completion of tasks on MTurk.

Attention check questions

Most of the ACQs used in the two experiments by Peer and colleagues were questions that seemed ordinary, such as “Which sports do you like?” or “How many people do you see in this picture?” But such questions were preceded by a lengthy block of text that contained instructions to disregard the question and instead give one specific response (e.g., “just click ‘Next’ instead of clicking the sports you like”, or “type 7 even though there are 6 people in the picture”). Thus, a participant not paying close attention may answer the ACQs “incorrectly” and we could use such a failure as an a priori reason to exclude all of that participant’s data from analysis.

The use of such ACQ’s is pretty routine in survey research and they have become increasingly sophisticated over time.

Reputation

When an MTurk worker completes an assigned task (called a HIT, for “human intelligence task”), the task’s requester can approve or reject the work. Work is generally only rejected if the worker did not follow instructions, for example if s/he only typed in two labels for a photo when the task instructions said to type at least three. The proportion of previously-approved tasks is a worker’s reputation. For example, if I have completed 100 HITs and only four were rejected, my reputation would be 96%. Furthermore, MTurk allows requesters to set a minimum reputation necessary for workers to participate in their tasks.

The studies by Peer and colleagues

In Experiment 1, the main tasks that participants completed were several well established short personality survey scales, and one classic question used to elicit the anchoring and adjustment heuristic (i.e., the fact that answering a hypothetical question about a clearly arbitrary anchor influences subsequent unrelated number estimates).

What should “high quality” data look like from these tasks? For participants who are paying attention and diligently completing the tasks, we would expect high internal reliability on the survey scales (measured as Cronbach’s alpha) and a positive non-zero effect size on the anchoring question (measured as Pearson’s r).

Peer et al. recruited separate groups of MTurk participants with high reputations (over 95%) and low-reputations (under 95%). In both groups, two thirds of participants completed several ACQs along with the main tasks, whereas the other third received no ACQs. The idea behind this design was to see whether we can get better quality data from high- versus low-reputation workers, and whether we can get better quality data by screening out participants who “fail” one or more of the ACQs.

Here are the results: High-reputation workers showed high survey scale reliability and the expected anchoring effect, regardless of whether they received ACQs. Moreover, almost all of them (97%) passed the ACQs. So, for high-reputation workers, the use of ACQs seems to have no costs and no benefits.

Low-reputation workers, by contrast, only showed comparable survey scale reliability and the expected anchoring effect if they received ACQs and passed them. Only 66% of the low-reputation workers passed the ACQs.

So it appears that workers with lower reputations are overall less likely to pay close attention to tasks, and less likely to yield high quality data. We could use ACQs to weed out the least attentive of the low-reputation workers… but why bother? The best thing to do appears to be simply to restrict participation to high-reputation workers (95% or higher) and forego ACQs altogether.

But what about the total number of prior HITs that a high-reputation worker has completed? Remember that many HITs on MTurk are simple tasks that can be completed in seconds. So workers who have successfully completed 95 out of 100 HITs may not be quite as dependable as workers who have successfully completed 950 out of 1,000. As with reputation, MTurk allows requesters to set a minimum number of prior approved HITs necessary for workers to participate in their tasks.

In Experiment 2, Peer and colleagues recruited high-reputation workers who had successfully completed under 100 HITs, and high-reputation workers who had successfully completed over 500 HITs. Note that all these workers had reputations of at least 95%.

Results showed that those who had successfully completed over 500 HITs did indeed show higher quality data in terms of survey scale reliability.

Bottom line: To ensure high quality data, restrict participation to MTurk workers with at least 95% reputation and at least 500 approved HITs.

Peer and colleagues note that the utility of workers’ reputations (and thus the validity of these recommendations) may not persist indefinitely into the future. The nature of the people who make up the MTurk worker population may shift over time, as may the number and nature of HITs made available to workers. It is also worth noting that the tasks used in this study were short and mostly survey questions. The whole procedure appears to have taken no more than about 10 minutes. For longer psychological tasks that require more sustained attention and diligence, it remains to be seen whether using attention-check questions may be seriously beneficial even when recruiting only high-reputation workers.

Overall what I really like about this kind of research is that it begins to enable psychology researchers to make principled decisions about the sampling procedures we use on the common platform provided by MTurk.

(Additional discussion of MTurk’s use in psychology research can be found here and here.)