Masicampo and LaLande (2012; henceforth M&L) assessed the distribution of 3627 exactly calculated p-values between .01 and .10 from 12 issues of three journals. The authors concluded that “The number of p-values in the psychology literature that barely meet the criterion for statistical significance (i.e., that fall just below .05) is unusually large”. “Specifically, the number of p-values between .045 and .050 was higher than that predicted based on the overall distribution of p.” There are four factors that determine the distribution of p-values—namely, the number of studies examining true effect and false effects, the power of the studies that examine true effects, the frequency of Type 1 error rates (and how they were inflated), and publication bias. Due to publication bias, we should expect a substantial drop in the frequency with which p-values above .05 appear in the literature. True effects yield a rightskewed p-curve (the higher the power, the steeper the curve; e.g., Sellke, Bayarri, & Berger, 2001). When the null-hypothesis is true, the p-curve is uniformly distributed, but when the Type 1 error rate is inflated due to flexibility in the data-analysis, the p-curve could become left-skewed below p-values of .05. M&L (and others—e.g., Leggett, Thomas, Loetscher, & Nicholls, 2013) model p-values based on a single exponential curve estimation procedure that provides the best fit of p-values between .01 and .10 (see Figure 3, right pane). This is not a valid approach, because p-values above and below p= .05 do not lie on a continuous curve, due to publication bias. It is therefore not surprising, nor indicative of a prevalence of pvalues just below .05, that their single curve does not fit the data very well, nor that chi-squared tests show that the residuals (especially those just below .05) are not randomly distributed. P-hacking does not create a peak in p-values just below .05. Actually, p-hacking does not even have to lead to a left-skewed p-curve. If you perform multiple independent tests in a study where the nullhypothesis is true, the Type 1 error rate is substantially increased, but the p-curve is uniform, as if you had performed five independent studies. The right skew (in addition to the overall increase in false positives) emerges through dependencies in the data in a repeated testing procedure, such as collecting data, performing a test, collecting additional data, and analysing the old and new data together. In Figure 1 two multiple testing scenarios (comparing a single mean to up to five other means, or collecting additional participants up to a maximum of five times) are simulated 100,000 times when there is no true effect (for details, see the supplemental material). Only 500 significant Type 1 errors should be observed in each bin without p-hacking,
[1]
Anton Kühberger,et al.
Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size
,
2014,
PloS one.
[2]
Michael E R Nicholls,et al.
The Life of p: “Just Significant” Results are on the Rise
,
2013,
Quarterly journal of experimental psychology.
[3]
M. J. Bayarri,et al.
Calibration of ρ Values for Testing Precise Null Hypotheses
,
2001
.
[4]
E. Masicampo,et al.
A peculiar prevalence of p values just below .05
,
2012,
Quarterly journal of experimental psychology.
[5]
Brian A. Nosek,et al.
Registered Reports A Method to Increase the Credibility of Published Results
,
2014
.
[6]
Leif D. Nelson,et al.
P-Curve: A Key to the File Drawer
,
2013,
Journal of experimental psychology. General.