Gorilla among MTurkers: Robust online data collection

As researchers begin to focus more and more on the factors that support replicability and replication in cognitive psychology, they are increasingly turning toward online venues for data collection.

Many experiments are still run in the lab with participants recruited from convenience samples because this gives researchers more control over their participants’ behavior, and often researchers have tried-and-tested ways of handling and preventing data loss.

That said, some issues do arise with studies that are interested in small effects that are difficult to observe reliably in small samples. Moreover, at many universities and colleges, participants are not broadly representative of the population, but rather a snapshot of a demographic that is predominantly White, Educated, Industrialized, Rich, and Democratic (WEIRD). We have discussed the issue of deWEIRDing our samples here before. Online studies represent one way to increase the diversity of participant pools. For example, an online study can recruit participants from the web, can be run in a web browser on any given laptop during a community open house, in classrooms, and so on.

The last fifteen years of research with online studies have largely taken place on Amazon’s Mechanical Turk (MTurk) platform. We have discussed the quality of MTurk data here before as well. If you have ever used this service, you know that the presentation of information is user unfriendly, messy, disorganized, and seemingly contradictory. A zoo of resources in Python, R, and Javascript exist to make Mechanical Turk easier to use, both in terms of the design and presentation of the experiment, and the collection and analysis of data.

There are also concerns on the platform about the poor pay for these so-called “human intelligence tasks”, in which the typical experienced “Turker” makes about $5 an hour, well below the minimum wage. There are also problems with the suspected use of bots to automatically fill in surveys, the data quality from gathering data from an exploited workforce, and other issues.

Remote participants also use various types of computers, with different operating systems, different browsers, varying computing resources, and better or worse internet connection speeds. Anyone who has ever had to open a website in another browser can tell you that it’s never a guarantee that a website will work for your computer. These technical issues affect the quality of data, in particular when one is seeking accurately measured reaction times. Without hiring an engineer specifically for your lab’s research, it can be hard to get started.

This is where Gorilla comes in.

Gorilla is a platform for the development of online experiments.

Through a lot of careful engineering and repeated validation tests, its creators have developed a platform that addresses many of the concerns that researchers have about running online studies. The Gorilla team (Alexander Anwyl-Irvine, Jessica Massonnié, Adam Flitton, Natasha Kirkham, and Jo Evershed) recently published an article in the Psychonomic Society journal Behavior Research Methods detailing the implementation and validation of their platform.

For today’s post, I will skip over the technical details that make Gorilla a competitive service platform. Instead, I will discuss their extensive replication of the “conflict network” effect in the attention literature.

The Gorilla team conducted a replication of a classic attention paradigm involving the flanker task.

You can check out a demo of the task here.

The flanker task is a familiar one to this blog. It’s a highly replicable task, which is often used to test how context and expectations influence how easily we can find something in a visual search task. It is a useful task for understanding how people are able to arrive at the right decision despite interference, and has spawned a number of creative and useful variants to study other aspects of attention.

In the “conflict network” variant of the task, participants see a set of “flanking” arrows either pointing in the same direction as a center arrow (“congruent” condition) or in different directions (“incongruent”). These conditions are summarized in the figure below. Participants must make a decision as to which way the center arrow is pointing – either to the left or to the right.

Decision times tend to be slower on incongruent trials than congruent ones, and are also somewhat less accurate in the incongruent case. These results have been attributed to engaging the “conflict network” which is recruited in cases of interference.

In the first set of experiments, the Gorilla team replicated this conflict network effect in three samples tested on children, across 268 students in 6 classrooms in Corsica, 3 classrooms in London, and a university open house. As presented below, accuracies were significantly lower, and reaction times were significantly slower, in the incongruent case. Moreover, the variability across participants and observations is fairly low, suggesting good reliability across the different sites used to collect data.

The real test of fire came, however, when the group ran the study online. Using a new alternative to Mechanical Turk known as Prolific.co to source participants, the team tested an additional 104 adults, with varying locations across the United Kingdom (all native English speakers), who participated on their own machines at home. While there was significant variability in the browsers people used, the sizes and speeds of their computers, the results replicated flawlessly:

The results of the two experiments demonstrate that an online platform that uses many layers of software in between a computer’s hardware (e.g., the keyboard) and data storage can nonetheless have data-gathering fidelity. Reaction times across many different kinds of machines outside the lab were robustly measured.

Altogether, the Gorilla platform seems to show significant progress toward building better systems for cognitive science research that do not rely on outdated, antiquated architecture, or complicated workarounds. As the world becomes more flexible in how we work, moving the lab to the internet can help us diversify our participant pools, but the technical challenges associated with this type of work have been non-trivial. As software in the browser gets more and more sophisticated, the gap between any two computers will continue to narrow.

The article also presents some valuable food for thought for psychologists getting started writing their own experiments. Given the increasing emphasis of replicability, it is important to realize that computer code and experiment standards change.

Only a few years ago, many of the tools we use to program our experiments were not even available – and for those who have been in the field for decades, it may be worthwhile to remind everyone else that the pace of technological changes means that something is not likely to continue to work as hardware and software get phased out. Unified frontends like PsychoPy and Experiment Builder, or programming frameworks like PsychToolBox or jsPsych are all fragile in the face of time.

With this in mind, platforms that are being actively maintained by dedicated groups of engineers that pose fewer barriers to entry are a priority for nearly all users of experimentation software. Gorilla represents just one manifestation of a growing field of better tools for better science that can be done anywhere.

Psychonomic society article featured in this post:

Anwyl-Irvine, A.L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J. K. (2019). Gorilla in our midst: An online behavioral experiment builder. Behavior Research Methods. DOI: 10.3758/s13428-019-01237-x.

Author

Cassandra Jacobs

Cassandra Jacobs is a graduate student in Psychology at the University of Illinois. Before this, she was a student of linguistics, psychology, and French at the University of Texas, where she worked under Zenzi Griffin and Colin Bannard. Currently she is applying machine learning methods from computer science to understand human language processing under the direction of Gary Dell.
View all posts

1 Comment

Jon Peirce says:

November 6, 2019 at 12:36 pm

“Unified frontends like PsychoPy,… PsychToolBox or jsPsych are all fragile in the face of time.”

On what is that assessment based? PsychoPy and Psychtoolbox certainly have been around for decades. They have large, active communities of both users and developers. PsychoPy has >25k monthly users (up from 20k last year), >3500 forum members, >100 volunteer contributors, 2.5 FTE dedicated developers (rising to 3.5 next year) and a sustainable funding model for the continued full-time development.

One *might* complain that PsychoPy’s online provision has some rough edges (it is very new), but claiming its existence is fragile seems a stretch, given its 17 year history and current growth trajectory.

(disclaimer: I am the lead maintainer of PsychoPy)