#PSBigData: Helping big data research become more ethical and more open

It’s easy to get excited about the promise of big data and naturally occurring datasets. Whether you were first captivated by “culturomics” nearly a decade ago or are first discovering its potential in this special issue of the Psychonomic Society’s journal Behavior Research Methods, you are not alone in seeing big data or naturally occurring data sets (or BONDS, as Tom Griffiths and I called it in 2017) as a new testbed for examining psychological theory outside of the lab and for exploring new behavioral dynamics in natural settings. The guest editors of the special issue, Gary Lupyan and Rob Goldstone, made this point in their opening post.

At the same time, the connectedness that is allowing us to aggregate BONDS is improving our ability to collect large-scale experimental data through crowdsourcing and citizen science. This point was made by Todd Gureckis and Tom Griffiths in their post and it also underlies the article by Joshua Hartshorne and colleagues in the special issue.

Under the right conditions, these data can provide rigorous tests of our existing theories, shed new light on old phenomena, and even lead to more transparent and open science.

Part of the excitement around BONDS (and large-scale approaches generally) lies in their richness. While a stereotypical view of “big data” emphasizes only the number of the bytes of data, a more multidimensional definition from IBM in 2013 identifies four different ways in which a given dataset might be considered rich: volume (or the size of the data you have), velocity (or how quickly you can turn the data into actionable insights), variety (or the different kinds of data you have in your dataset), and veracity (or the uncertainty in your data). This definition is particularly useful because it can encompass many of the concerns that we as psychological scientists already have about data from traditional experimental perspectives, like statistical power and external validity. BONDS datasets—while not always petabytes in size—are striking because they are large-scale, fast-moving, multi-faceted, and life-like.

However, this very richness presents psychological scientists with serious concerns for participant ethics. We must continue to respect participants’ autonomy, minimize potential harm to participants, maximize potential benefits to participants, and equally distribute benefits and risks of participation across groups even in this new age of rich data. Data like GPS coordinates, video, audio, and even social media activity afford incredible insights into real-world behaviors by tracing an individual’s behavior over time, but they carry the potential for equally incredible risks to the individual participants with them, as outlined by Simon Dennis and colleagues in the current special issue. This poses a fundamental problem for advocates of open science—particularly those who call for open data.

How can we meet our fundamental responsibility to our participants’ rights and our responsibility for scientific openness?

There is growing awareness within the public about the potential risks of data-sharing. Members of the public hear regularly about countless data breaches, about hackers ransoming data to people and municipalities, and even about organizations voluntarily sharing sensitive data with third parties. This has led to increasing public concern about how their data are shared, but there is still surprisingly little understanding of important mechanisms behind data-sharing, like terms-of-use agreements or the identifying or discriminatory power of their data. This lack of understanding poses a particular problem for researchers in psychological science committed to the Belmont principle of respect for persons, given the pivotal roles of voluntariness and comprehension (National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979).

A large part of the problem of BONDS data ethics intersects with society and technology in a way that is beyond the reach of most academic researchers—especially those who re-use datasets not originally intended for academic research purposes (like social media datasets). In fact, the majority of BONDS research that exclusively repurposes data not collected for research purposes are explicitly considered not human-subjects research, thereby exempting the research from institutional review board (IRB) oversight. Confronting BONDS ethics in those cases will require a legal and ethical framework that can provide adequate data protections in cases outside of human-subjects research.

However, in the current special issue, Simon Dennis and colleagues consider a very specific but particularly thorny slice of this concern: How can researchers collect new rich data from participants openly and ethically? This question is particularly timely as cognitive scientists are working to blend the richness of big data with the rigor of experimental methods, often taking advantage of the data-collection opportunities of experience sampling methods that can—for example—record brief snippets of audio at random intervals throughout the day or track a participant’s GPS coordinates over several weeks.

These projects are considered new human-subjects research activity, so they are bound by the legal principles in the Common Rule. However, the localized nature of IRBs means that different institutions can handle the same situations in dramatically different ways. These differences can be particularly pronounced when it comes to how (or even whether) researchers are allowed to share their data in the interest of open science.

Echoing similar discussions in other areas of the public sphere, Dennis and colleagues propose an interesting solution—that participants remain owners of their data and that researchers access those data in a relatively limited way. Rather than having researchers be de facto owners of the data from an experiment, participants would voluntarily share whatever degree of data they wanted to share (like their GPS data or call records), and each participant would upload their data themselves to a relatively restricted-access repository that would be keyed to them with a random identifying number. Unlike today’s data analysis model—in which researchers hold their data on their own computers and analyze them locally—this model would prevent researchers from seeing the raw data and would instead have researchers submit requests for specific analyses to the repository directly. The output from those analyses would be sent back to the researchers, along with the identification numbers of the participants who were included in the dataset. The output would ensure participant privacy by making sure that no single participant’s data could be identified from the output, and the identification numbers would facilitate openness by allowing other researchers to access the same sample as the original dataset (but without seeing the original data).

Of course, even this proposal has its concerns. For example, if participants are paid fairly for the value of their data, it would be important to show that the compensation for data-sharing wouldn’t be considered coercive; otherwise, the voluntariness of that sharing could be called into question, just as terms-of-service agreements are being questioned. At the same time, the true degree of openness of data access would be questionable if the cost of market-value participant payments fell on individual researchers, since only researchers with significant funding would be able to afford it.

Despite these (and other) considerations that must be addressed, this proposal takes a first, important step toward both protecting our participants’ rights and promoting open-science practices.

Author

Alexandra Paxton

View all posts

#PSBigData: Helping big data research become more ethical and more open

Author

You may also like

Is simplicity always desirable in an explanation?

#PSBigData: Big Theory

How accurate does the tortoise have to be to beat the hare?