#PSBigData: The Guest Editors’ agenda

(This post was co-authored with Rob Goldstone). Like many other scientific disciplines, psychological science has felt the impact of the big data revolution. This impact arises from the meeting of three forces: Data availability, data heterogeneity, and data analyzability.

Availability. Consider that for decades, researchers have relied on the Brown Corpus of about 1 million words, published by Kuçera and Francis in 1969. Modern resources are larger by 6 orders of magnitude (e.g., Google’s 1T corpus) and are available in a growing number of languages. About 240 billion photos have been uploaded to Facebook, and Instagram receives over a 100 million new photos each day. The large-scale digitization of this data has made it in-principle possible to analyze and aggregate these resources on a previously unimagined scale.

Heterogeneity refers to the availability of different types of data. For example, recent progress in automatic image recognition owes itself not just to improvements in algorithms and hardware, but arguably even more on the ability to merge large collections of images with linguistic labels (produced by crowd-sourced human taggers) which serve as training data to the algorithms. Making use of heterogeneous data sources often depends on standardization. For example, the ability to combine demographic and grammatical data about thousands of languages led to the finding that languages spoken by more people have simpler morphology. The ability to combine these data types would have been substantially more difficult without the existence of standardized language and country codes that could be used to merge the different data sources.

Analyzability. Without appropriate tools to process and analyze all of these different types of data, the “data” are mere bytes.

We had three main goals in assembling this special issue of Behavior Research Methods that forms the basis of this digital event. First, to highlight work that makes new types of data available or more accessible. Second, to highlight work that demonstrates creative merging of different types of data. Third, to report research that describes new techniques that make it possible to draw useful inferences from the data, with a focus on advancing psychological theory.

The call for contributions for this volume was broad, mentioning mining data from online databases and other “naturally occurring datasets”, construction of new linguistic corpora, methods for analyzing diverse data sources, and the creation of environments for gamifying data collection (by making experiments fun, participant enrollment can be increased at no additional cost). We deliberately omitted any restrictions on the number of participants, stimuli, or observations, lest the concept of “big data” be reduced to “large files”. We favored contributions that emphasized the impact of the new data sources/techniques on advancing psychological theory rather than as an end in itself. Lastly, we tried to maximize the usefulness of any new datasets and analytic techniques by emphasizing the importance of open data, open experimental materials, and open code for analysis.

We grouped the contributions to this special issue into three broad clusters: uses of naturalistic or crowdsourced data, methodological advances (with an emphasis on improving data analyzability), and creation of new data resources. We were pleased that many contributions could be placed in more than one cluster.

Taken together, the contributions to this special issue of Behavioral Research Methods provide an update and surprising extension to theoretical developments in ecological psychology.  Ecological psychology as originally promulgated by Neisser and Gibson  was premised on the need for psychology to study behavior in real-world, not only laboratory, environments.  The contemporary extension of this call to study behavior in the real world, as exemplified by the articles in this issue, is to note that a large amount of a modern person’s real world consists of language, people, and technological innovations.  To get a complete understanding of human behavior, we need to understand human-environment interactions, where the environment crucially includes our cell phones, instant messages, online communities, email, movies, online games, sporting contests, and computers.

As the current articles attest, all of these environmental components can provide a cornucopia of data when creatively and diligently investigated.  The tools and analyses developed in these articles offer diverse perspectives impossible to achieve in the laboratory: diverse measures taken over diverse conditions in diverse contexts from a diverse sample of participants.  These multiple diversities will allow researchers of the future to much more effectively study representative samples of behavior.  These samples will generalize to real-world behavior much more robustly than our traditional laboratory methods can, in no small part because they are taken from real-world behavior.  As stated earlier, we are not interested in equating “Big Data” with file sizes exceeding a particular threshold, but we do find the endeavor of “embiggening” behavioral science to include the kinds of diverse contexts, behaviors, tasks, and purposes found in the real world to be exciting and worthwhile.

The posts that form part of this digital event will explore those diverse issues in greater detail.

The Psychonomic Society (Society) is providing information in the Featured Content section of its website as a benefit and service in furtherance of the Society’s nonprofit and tax-exempt status. The Society does not exert editorial control over such materials, and any opinions expressed in the Featured Content articles are solely those of the individual authors and do not necessarily reflect the opinions or policies of the Society. The Society does not guarantee the accuracy of the content contained in the Featured Content portion of the website and specifically disclaims any and all liability for any claims or damages that result from reliance on such content by third parties.

You may also like