Should you have to be a data archeologist with open data?
Have you ever come back to a dataset a few years after collecting it, and scratched your head over what Past You was thinking and what the variables and labels mean? Now, imagine it’s not Past You, but another researcher, and you are trying to use the dataset they released along with their paper five years ago.
As of this writing, it is very common to have Psychology journals require, at a minimum, a data availability statement, and more journals are pushing for authors to make their data publicly accessible, rather than just saying “data available on request.” Can you find someone’s dataset and how useful is it once you do? John Towse, David Ellis, and Andrea Towse asked, in their recent Behavioral Research Methods paper, how often Psychologists make their data available in a publicly-accessible way, and whether the datasets they could find were useful on their own.
Open means useful, right?
To get a sense of the landscape here, the authors looked across fifteen Psychology journals, examining 2243 papers published between 2014-2017 to see whether they provided open, publicly accessible, data and how usable that data was on its own. They found that relatively few papers – on the order of 4% across their entire set of sampled papers – made data publicly available, and that only a subset of these datasets was fully usable on their own.
Digging into data from an old project of one’s own always involves a certain degree of data archeology, even if one keeps fastidious notes on every detail of the experiment and analysis, and this is exacerbated when one is sifting through someone else’s dataset without the benefit of asking them about what they did and why they did it. So, you might be spending your time trying to sift out the data you need from the less useful elements in the datafiles, and you might then not be able to make sense of the data if you don’t know what you’re looking at.
Now, to take the archeological metaphor further, it’s probably not very useful to just throw the data somewhere and have some kind of data midden that other researchers can dig through, although that’s pretty much what the pile of old computers in the corner of the lab is (hopefully that one from 2005 still boots). It is a lot more useful if there’s some organization, and some hope that a future researcher might be able to find what they’re looking for. Except that, as Towse and colleagues note, just being able to find the data isn’t enough if you can’t make sense of it – to take an archeological example, finding a Linear A tablet, if you are an archeologist on a dig in Greece, tells you something, but since we cannot read Linear A, there is a huge amount you would miss without the ability to make sense of the inscription.
Simplifying the data dig
Having evaluated the 71 datasets they were able to access, Towse and colleagues found that many of them were not as usable as one might hope. For example, a dataset in a proprietary format, like the files used natively by SPSS, might as well be in Linear A if one does not have a license for the software to read them. So, then, how might one might make it easier for a fellow researcher to use data that is posted publicly? Assuming, first, that you haven’t coded your data in a dead language that cannot be read by any living souls (not a recommended solution to the problem of sensitive data), one key recommendation of this paper is to include a data dictionary or read me file along with your dataset. Yes, this is extra work, but if it’s there, your colleagues (or even you) will know what they’re looking at, and don’t mistake your condition variable for the participant’s response, or something equally odd.
Why open data, and how can we make it better?
It is worth reflecting on why we might want to make data available – beyond the simple fact of a journal or a funding agency mandating it – and why we might want to heed the recommendations from this paper. Fundamentally, we are all building on each other’s work, and the ability to go and learn directly from someone else’s data or experiment will help us increase our collective knowledge faster, with fewer dead ends and uninformative experiments.
How can we make it better? We can put our data and experiments out in the world in a permanent way – not relying on journals’ linking policy for supplemental materials or our personal laboratory websites, but using repositories that exist to provide a stable home for data, stimuli and code. That way, we have a library, with an accurate catalog, rather than a data midden. Furthermore, in putting our work in these repositories, we should think about Future Colleagues – and, probably, our future selves! If the dataset is in a repository and it’s been well-documented, it is a lot more useful than if it’s a pile of poorly organized Matlab files on a dusty computer in the corner of the lab! Embracing Towse and colleagues’ recommendations will do more than helping Future Colleagues avoid the need to play data archeologist – they are very likely to help us help ourselves be better, more productive scientists in the future.
Featured Psychonomic Society article
Towse, J. N., Ellis, D. A., & Towse, A. S. (2020). Opening Pandora’s Box: Peeking inside Psychology’s data sharing practices, and seven recommendations for change. Behavior Research Methods, 1-14. https://doi.org/10.3758/s13428-020-01486-1