Are the times a-changin’? Reporting before and after the 2015 statistical guidelines

“Progress is not possible without deviation.” — Frank Zappa

The ways by which psychological science deals with methodological problems are many. There are bottom-up approaches such as peer-reviewed papers and workshops given by methodologists who advocate particular types of changes. There are also top-down approaches, such as funding agencies requiring data sharing or journals requiring statements about the choice of sample size. Scientific societies and journals also publish statistical guidelines that are not meant to be hard requirements, but are rather meant to help researchers along by outlining good practice.

For instance, in 1999 the APA’s Taskforce on Statistical Inference’s released Statistical Methods in Psychology Journals: Guidelines and Explanations, which has served as an overview of good practice for nearly two decades. In 2012, the Psychonomic Society released their statistical guidelines, which covers much of the same ground but with content devoted to more recent developments, such as Bayesian statistics. In 2016, the American Statistical Association released their statement on the use of p values for statistical inference, along with an invaluable set of commentaries to put the statement in perspective.

One question that naturally arises from such guidelines is: do they have any effect on what is published in the Society’s journals? This is a difficult question to answer, of course, because causation cannot be established. Several similar journals published by different societies often serve the same field; this suggests that one might be able to compare results published in the Society’s journals — which are covered by the guidelines — to results published in other journals that are not. In addition, we can look at reporting just before the guidelines were published to see how it differs from just after.

In a paper recently published in Psychonomic Bulletin & Review, Peter Morris and Catherine Fritz compare statistical reporting in Psychonomic Society journals (including Psychonomic Bulletin & Review; Memory & Cognition; Attention, Perception & Psychophysics; Cognitive, Affective, & Behavioral Neuroscience; and Learning & Behavior) with the Quarterly Journal of Experimental Psychology (QJEP), published for the Experimental Psychology Society. The Psychonomic Society and the Experimental Psychology Society have similar remits, so QJEP serves as a good baseline for comparison. In both journals, they look at reporting in papers published 2013 — accepted before the guidelines were published — compared to reporting in papers in published in 2015.

Morris and Fritz focus on several targets of the 2012 guidelines:

Power: Is statistical power mentioned, and is an a priori power analysis provided?
Uncertainty: are standard deviations, standard errors, and confidence intervals provided, either graphically or in the text?
Effect sizes: Are standardized effect sizes reported (e.g., η², d, R²)?

So how’d we do? Did the guidelines appear to have an effect? The results are mixed.

The figure below shows the proportion of papers in 2013 and 2015 reporting an a priori power analyses in Psychonomic Society (PS) journals and QJEP (this figure is not in the paper; it is cobbled from the online supplementary material.) On the left, PS journals as an aggregate are compared to QJEP. The percentage of power analyses in PS journals increased two-fold from 2013 to 2015, but they actually decreased in QJEP. Although a similar proportion of papers reported a power analysis in 2013 between the two journals, by 2015, the proportion of papers reporting a power analysis was five times greater in PS journals.

However, this improvement was not uniform across all the PS journals. The right side of the above figure shows that large increases are evident in Cognitive, Affective, & Behavioral Neuroscience, Memory & Cognition, and Psychonomic Bulletin & Review, with much smaller gains (or none) in Learning & Behavior and Attention, Perception & Psychophysics. It must be said, however, that even the journal with the highest proportion of articles with power analyses — Memory & Cognition — only a quarter of articles included one.

Moving to indicators of uncertainty, Morris and Fritz also found overall moderate levels of reporting of standard errors and low levels for confidence intervals. Both these numbers appeared to tick up slightly after the PS statistical guidelines were published.

In the figure below we can see that a large proportion of papers reported standard errors (either graphically or in text), though the practice is far from universal. Just over a bare majority of papers overall reported standard errors. Interestingly, QJEP showed a large increase of 16 percentage points in reporting of standard errors between 2013 and 2015; given the fact that they released no statistical guidelines in that span, we should be cautious of interpreting any effects here as specifically caused by the PS guidelines. The PS journals notched a more modest increase of 7 percentage points overall, though again the increase was uneven across the PS journals.

Morris and Fritz also report that the vast majority of papers including graphs showed some sort of error bars — often standard errors — but that in about a quarter of cases overall, the error bars were not identified.

In contrast to standard errors, the overall reporting of confidence intervals is very low. The figure below shows the percentage of papers reporting confidence intervals, either graphically or in the text. From 2013 to 2015, reporting of confidence intervals both PS journals and QJEP increase slightly, but the increase was moderately larger for the PS journals. If we look at the five PS journals individually, we find again the larger gains in Cognitive, Affective, & Behavioral Neuroscience, Memory & Cognition, and Psychonomic Bulletin & Review, and smaller gains in Learning & Behavior and Attention, Perception & Psychophysics.

Morris and Fritz report that overall substantially more than half of articles reported some sort of standardized effect size, though in only two cases was this effect size reported with a confidence interval. As shown in the figure below, reporting of effect sizes increased in both PS journals and QJEP between 2013-2015, making it again difficult to argue for any specific effect of the statistical guidelines on PS journals.

Interestingly, the journal with the largest increase in reporting effect size was Learning & Behavior, which showed a staggering 28 percentage point increase. Given that Learning & Behavior showed no increases in reporting of power analyses, standard errors, or confidence intervals, one wonders whether this increase is simply random variability in the authors submitting to the journal, or whether this indicates something about the kind of research published in the journal.

Overall, Morris and Fritz argue that the guidelines appear to have had an effect on the reporting in PS journals, but that effect is admittedly small at best. Guidelines likely have only a small, temporary effect compared to reviewers and editors exerting direct pressure on authors, and in my experience, reviewers seldom address these sorts of detailed issues. The lack of comprehensive reporting is especially concerning given the fact that none of the journals considered require data sharing by default. The reporting in the paper might be all the world ever sees; if it is incomplete, the future impact of the work is in doubt. Hopefully, Morris and Fritz’s summary spurs editors to do more to encourage authors to take the time to report their results more completely.

Article focused on in this post:

Morris, P. E. & Fritz, C. O. (2017). Meeting the challenge of the Psychonomic Society’s 2012 Guidelines on Statistical Issues: Some success and some room for improvement. Psychonomic Bulletin & Review, doi: 10.3758/s13423-017-1267-y

Are the times a-changin’? Reporting before and after the 2015 statistical guidelines

Article focused on in this post:

You may also like

Automatic detection of automatic response generators: How to improve data quality in online research

A closer look at the hidden faces of face recognition impairment: Excluded cases from prosopagnosia research

If only we could measure the entire population: Sampling precision across subdisciplines of psychological science