#PSBigData: What you say shapes what I say: Building a causal theory from wild data

HARRIS: Well, there was a failure of—of states to—to integrate—

BIDEN: —No, but—

HARRIS: —Public schools in America. I was part of the second class to integrate, Berkeley, California Public Schools almost two decades after Brown v. Board of Education.

BIDEN: Because your city council made that decision. It was a local decision.

HARRIS: So, that’s where the federal government must step in.

That was the content of a heated exchange between Presidential candidates Kamala Harris and Joe Biden during the first Democratic primary debate of 2019. If you ignore the podiums and stage lights, this exchange was like most other instances of communication: you say something, then I do, then you do, with lots of interruptions and starts and stops in between. Understanding the psychological mechanisms underlying this sort of repeated communicative interaction is a foundational question for cognitive science. How do we agree upon a meaning? How do we manage not to talk over each other? And, how does your message affect how I feel and what I say? How about 60 seconds from now?

Norman Rockwell, “The Gossips”

Yet, relatively little is known about how these complex processes unfold in naturalistic contexts like that on the debate stage. This gap in the literature isn’t theoretical, it’s methodological: Naturalistic interactions are hard to reproduce in a lab setting, and the data are messy once you have them. But there’s reason to think that the kind of phenomena we can observe on the lab scale may differ in qualitative ways from naturalistic data.

The availability of large scale naturalistic data has increased dramatically over the last ten years, and is a focus of the current special issue of Behavior Research Methods. These sort of “found” datasets provide a rich source of data about complex psychological phenomena in naturalistic contexts. In the current issue alone, data from a range of sources are used to answer psychological questions, including Reddit, Yelp, Amazon, Twitter, Social Security Administration, and my personal favorite, the National Basketball Association.

In one paper in this issue, The rippling dynamics of valenced messages in naturalistic youth chat, Seth Frey, Karsten Donnay, Dirk Helbing, Robert W. Sumner, and Maarten W. Bos leverage a dataset of 250 million online chat messages to explore the dynamics of communicative interaction. Specifically, they asked: When does the emotional valence of a message (positive or negative) affect the rate of responses to that message?

Overall, they found that negative messages (e.g., “lol you’re lying!”) lead to about three times more chat responses than positive messages (e.g., “Ok, then great work!”). Further, because of the size of the dataset, the researchers were able to ask fine-grained questions about the timescale of this difference. The plot below shows that the increased rate of messaging after a negative message persists for over a minute, and decreases as time passes.

An important limitation of big data for answering psychological questions is that it’s hard to make the kind of causal inferences necessary for building explanatory theories. Sure – Harris said something provocative, Biden responded, and then Harris responded provocatively, but how do we untangle causality from these observations? To what extent was Harris’ provocative response due to feedback from Biden’s statement versus a direct effect of her initial statement? Did some earlier statement facilitate and push her response over threshold?

One solution to this problem is to couple naturalistic study of phenomena with more tightly controlled experimental studies. This approach allows us both to see things “in the wild” but also make inferences about causality. Another promising approach to causal inference — taken by the authors of the current study — is to use an observational causal reasoning method called “matching” (here, here and here). This approach is common in many other social sciences, where observational data are the norm, but is rare in the cognitive sciences. Broadly, the idea behind this approach is that you can make inferences about the underlying cause of an effect by splitting the observations into groups based on the causal variable (treatment vs. control), and then finding matching observations across the two groups that are similar on all dimensions except the causal variable. By comparing the groups on an outcome measure, this approach allows you to approximate the gold standard of random assignment using observational data.

In the current study, the authors used this matching approach to understand the causal link between the source of a valenced message and rate of messaging. The researchers exploited a feature of the chat room that censored messages that were inappropriate (e.g., insults, personally identifying information, etc.): In cases where the chat system could not strongly determine if the message was appropriate (a surprisingly large fraction), the system “fake sent” the message such that it appeared to the sender to have sent, but was not visible to the recipient. By comparing fake sent versus actually sent messages using the matching method, the researchers were able to disentangle direct versus social-feedback effects on the rate of messaging. They found evidence for a social feedback effect that was smaller but more extended in time than the direct effect.

This study provides a beautiful case study of how existing, observational datasets coupled with causal reasoning techniques can provide insights into foundational cognitive science questions. The potential behind large scale observational study is vast and we’re only beginning to figure out how we can leverage these data to answer the scientific questions we care about. In the age of large scale digital data, the challenge is becoming less, “What does my phenomenon look like in the wild?” and more, “How do I build a causal theory from wild data?.”

Psychonomics Article considered in this post:

Frey, S., Donnay, K., Helbing, D., Sumner, R. W., and Bos, M. W. (2019). The rippling dynamics of valenced messages in naturalistic youth chat. Behavior Research Methods. DOI: 10.3758/s13428-018-1140-6.

Author

Molly Lewis

View all posts

#PSBigData: What you say shapes what I say: Building a causal theory from wild data

Author

You may also like

38 shades of play: Commencing a digital event on the science of a diverse and pervasive behavior

Out-thinking sub-optimal survey responders

#PSDiversityandInclusion: How can we increase the representation of women at senior levels in Psychology?