Data Archeologist or Data Librarian: Some recommendations for open data

Should you have to be a data archeologist with open data?

Have you ever come back to a dataset a few years after collecting it, and scratched your head over what Past You was thinking and what the variables and labels mean? Now, imagine it’s not Past You, but another researcher, and you are trying to use the dataset they released along with their paper five years ago.

As of this writing, it is very common to have Psychology journals require, at a minimum, a data availability statement, and more journals are pushing for authors to make their data publicly accessible, rather than just saying “data available on request.” Can you find someone’s dataset and how useful is it once you do? John Towse, David Ellis, and Andrea Towse asked, in their recent Behavioral Research Methods paper, how often Psychologists make their data available in a publicly-accessible way, and whether the datasets they could find were useful on their own.

Open means useful, right?

To get a sense of the landscape here, the authors looked across fifteen Psychology journals, examining 2243 papers published between 2014-2017 to see whether they provided open, publicly accessible, data and how usable that data was on its own. They found that relatively few papers – on the order of 4% across their entire set of sampled papers – made data publicly available, and that only a subset of these datasets was fully usable on their own.

Digging into data from an old project of one’s own always involves a certain degree of data archeology, even if one keeps fastidious notes on every detail of the experiment and analysis, and this is exacerbated when one is sifting through someone else’s dataset without the benefit of asking them about what they did and why they did it. So, you might be spending your time trying to sift out the data you need from the less useful elements in the datafiles, and you might then not be able to make sense of the data if you don’t know what you’re looking at.

Towse2021fig1 — Looking for the data you need: sifting through someone else’s data or experiment files can be like looking for archeological artifacts in a shovelful of soil, and even then, you might not be able to understand what you’re looking at. *Source: Forestry Suppliers.*

Now, to take the archeological metaphor further, it’s probably not very useful to just throw the data somewhere and have some kind of data midden that other researchers can dig through, although that’s pretty much what the pile of old computers in the corner of the lab is (hopefully that one from 2005 still boots). It is a lot more useful if there’s some organization, and some hope that a future researcher might be able to find what they’re looking for. Except that, as Towse and colleagues note, just being able to find the data isn’t enough if you can’t make sense of it – to take an archeological example, finding a Linear A tablet, if you are an archeologist on a dig in Greece, tells you something, but since we cannot read Linear A, there is a huge amount you would miss without the ability to make sense of the inscription.

Towse2021fig2 — Just having the data isn’t enough: this image shows a Linear A tablet, known to be a Minoan language that archeologists have yet to be able to translate. Source: Olaf Tausch Tontafel mit Linearschrift A aus dem Palast von Zakros, Archäologisches Museum von Sitia, Kreta, Griechenland.

Simplifying the data dig

Having evaluated the 71 datasets they were able to access, Towse and colleagues found that many of them were not as usable as one might hope. For example, a dataset in a proprietary format, like the files used natively by SPSS, might as well be in Linear A if one does not have a license for the software to read them. So, then, how might one might make it easier for a fellow researcher to use data that is posted publicly? Assuming, first, that you haven’t coded your data in a dead language that cannot be read by any living souls (not a recommended solution to the problem of sensitive data), one key recommendation of this paper is to include a data dictionary or read me file along with your dataset. Yes, this is extra work, but if it’s there, your colleagues (or even you) will know what they’re looking at, and don’t mistake your condition variable for the participant’s response, or something equally odd.

Towse 2021 fig 3 — *FAIR principles for open data: A good place to start! Source: Springer Nature*

Why open data, and how can we make it better?

It is worth reflecting on why we might want to make data available – beyond the simple fact of a journal or a funding agency mandating it – and why we might want to heed the recommendations from this paper. Fundamentally, we are all building on each other’s work, and the ability to go and learn directly from someone else’s data or experiment will help us increase our collective knowledge faster, with fewer dead ends and uninformative experiments.

How can we make it better? We can put our data and experiments out in the world in a permanent way – not relying on journals’ linking policy for supplemental materials or our personal laboratory websites, but using repositories that exist to provide a stable home for data, stimuli and code. That way, we have a library, with an accurate catalog, rather than a data midden. Furthermore, in putting our work in these repositories, we should think about Future Colleagues – and, probably, our future selves! If the dataset is in a repository and it’s been well-documented, it is a lot more useful than if it’s a pile of poorly organized Matlab files on a dusty computer in the corner of the lab! Embracing Towse and colleagues’ recommendations will do more than helping Future Colleagues avoid the need to play data archeologist – they are very likely to help us help ourselves be better, more productive scientists in the future.

Featured Psychonomic Society article

Towse, J. N., Ellis, D. A., & Towse, A. S. (2020). Opening Pandora’s Box: Peeking inside Psychology’s data sharing practices, and seven recommendations for change. Behavior Research Methods, 1-14. https://doi.org/10.3758/s13428-020-01486-1

Author

Benjamin Wolfe

Benjamin Wolfe is an Assistant Professor in the Department of Psychology at the University of Toronto, Mississauga. His research sits at the intersection of applied and basic vision science, including questions of visual perception in driving, improving readability and extending our understanding of visual perception in real-world settings.
View all posts