Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value

Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.

[1]  M. Chernick,et al.  The Saw-Toothed Behavior of Power Versus Sample Size and Software Solutions , 2002 .

[2]  R. Lanfear,et al.  The Extent and Consequences of P-Hacking in Science , 2015, PLoS biology.

[3]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[4]  Andrew Gelman,et al.  Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. , 2014, Biostatistics.

[5]  Alex Reinhart Statistics Done Wrong: The Woefully Complete Guide , 2015 .

[6]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[7]  Daniël Lakens,et al.  On the challenges of drawing conclusions from p-values just below 0.05 , 2015, PeerJ.

[8]  A. Greenwald Consequences of Prejudice Against the Null Hypothesis , 1975 .

[9]  W. K. Simmons,et al.  Circular analysis in systems neuroscience: the dangers of double dipping , 2009, Nature Neuroscience.

[10]  Jeffrey T Leek,et al.  An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[11]  Daniel Lakens What p-hacking really looks like , 2014 .

[12]  D G Altman,et al.  Statistics in medical journals. , 1982, Statistics in medicine.

[13]  Daniël Lakens,et al.  What p-hacking really looks like , 2014 .

[14]  Leif D. Nelson,et al.  Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). , 2015, Journal of experimental psychology. General.

[15]  Dorothy V. M. Bishop,et al.  Problems in using text-mining and p-curve analysis to detect rate of p-hacking , 2015 .

[16]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[17]  Patrick Dattalo,et al.  Statistical Power Analysis , 2008 .

[18]  R G Newcombe,et al.  Towards a reduction in publication bias. , 1987, British medical journal.

[19]  R. Kievit,et al.  The meaning of "significance" for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. 1969. , 2014, Acta psychologica.

[20]  A. D. de Groot,et al.  The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas] , 2014 .

[21]  Brian A. Nosek,et al.  Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention , 2014, Trends in Cognitive Sciences.

[22]  Dimitra Dodou,et al.  A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too) , 2015, PeerJ.

[23]  D G Altman,et al.  Statistics in medical journals: developments in the 1980s. , 1991, Statistics in medicine.

[24]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[25]  John P. A. Ioannidis,et al.  How to Make More Published Research True , 2014, PLoS medicine.

[26]  John P A Ioannidis,et al.  Discussion: Why "An estimate of the science-wise false discovery rate and application to the top medical literature" is false. , 2014, Biostatistics.

[27]  P. Kendall,et al.  The Oxford Handbook of Research Strategies for Clinical Psychology , 2013 .

[28]  E. Masicampo,et al.  A peculiar prevalence of p values just below .05 , 2012, Quarterly journal of experimental psychology.

[29]  C. Begg,et al.  Publication bias : a problem in interpreting medical data , 1988 .

[30]  P. Meehl Why Summaries of Research on Psychological Theories are Often Uninterpretable , 1990 .