What p-hacking really looks like

Masicampo and LaLande (2012; henceforth M&L) assessed the distribution of 3627 exactly calculated p-values between .01 and .10 from 12 issues of three journals. The authors concluded that “The number of p-values in the psychology literature that barely meet the criterion for statistical significance (i.e., that fall just below .05) is unusually large”. “Specifically, the number of p-values between .045 and .050 was higher than that predicted based on the overall distribution of p.” There are four factors that determine the distribution of p-values—namely, the number of studies examining true effect and false effects, the power of the studies that examine true effects, the frequency of Type 1 error rates (and how they were inflated), and publication bias. Due to publication bias, we should expect a substantial drop in the frequency with which p-values above .05 appear in the literature. True effects yield a rightskewed p-curve (the higher the power, the steeper the curve; e.g., Sellke, Bayarri, & Berger, 2001). When the null-hypothesis is true, the p-curve is uniformly distributed, but when the Type 1 error rate is inflated due to flexibility in the data-analysis, the p-curve could become left-skewed below p-values of .05. M&L (and others—e.g., Leggett, Thomas, Loetscher, & Nicholls, 2013) model p-values based on a single exponential curve estimation procedure that provides the best fit of p-values between .01 and .10 (see Figure 3, right pane). This is not a valid approach, because p-values above and below p= .05 do not lie on a continuous curve, due to publication bias. It is therefore not surprising, nor indicative of a prevalence of pvalues just below .05, that their single curve does not fit the data very well, nor that chi-squared tests show that the residuals (especially those just below .05) are not randomly distributed. P-hacking does not create a peak in p-values just below .05. Actually, p-hacking does not even have to lead to a left-skewed p-curve. If you perform multiple independent tests in a study where the nullhypothesis is true, the Type 1 error rate is substantially increased, but the p-curve is uniform, as if you had performed five independent studies. The right skew (in addition to the overall increase in false positives) emerges through dependencies in the data in a repeated testing procedure, such as collecting data, performing a test, collecting additional data, and analysing the old and new data together. In Figure 1 two multiple testing scenarios (comparing a single mean to up to five other means, or collecting additional participants up to a maximum of five times) are simulated 100,000 times when there is no true effect (for details, see the supplemental material). Only 500 significant Type 1 errors should be observed in each bin without p-hacking,