A Tutorial on Hunting Statistical Significance by Chasing N

There is increasing concern about the replicability of studies in psychology and cognitive neuroscience. Hidden data dredging (also called p-hacking) is a major contributor to this crisis because it substantially increases Type I error resulting in a much larger proportion of false positive findings than the usually expected 5%. In order to build better intuition to avoid, detect and criticize some typical problems, here I systematically illustrate the large impact of some easy to implement and so, perhaps frequent data dredging techniques on boosting false positive findings. I illustrate several forms of two special cases of data dredging. First, researchers may violate the data collection stopping rules of null hypothesis significance testing by repeatedly checking for statistical significance with various numbers of participants. Second, researchers may group participants post hoc along potential but unplanned independent grouping variables. The first approach ‘hacks’ the number of participants in studies, the second approach ‘hacks’ the number of variables in the analysis. I demonstrate the high amount of false positive findings generated by these techniques with data from true null distributions. I also illustrate that it is extremely easy to introduce strong bias into data by very mild selection and re-testing. Similar, usually undocumented data dredging steps can easily lead to having 20–50%, or more false positives.

[1]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[2]  D. Bakan,et al.  The test of significance in psychological research. , 1966, Psychological bulletin.

[3]  J. Ioannidis,et al.  An exploratory test for an excess of significant findings , 2007, Clinical trials.

[4]  W. K. Simmons,et al.  Circular analysis in systems neuroscience: the dangers of double dipping , 2009, Nature Neuroscience.

[5]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[6]  P. Pollard,et al.  On the probability of making Type I errors. , 1987 .

[7]  W. W. Rozeboom The fallacy of the null-hypothesis significance test. , 1960, Psychological bulletin.

[8]  Doreen Eichel,et al.  Data Analysis A Bayesian Tutorial , 2016 .

[9]  J. Wicherts,et al.  The Rules of the Game Called Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[10]  T L Lai,et al.  Sequential medical trials. , 1980, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[12]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[13]  G. Francis Replication, statistical consistency, and publication bias. , 2013 .

[14]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[15]  J. Wicherts,et al.  Outlier removal, sum scores, and the inflation of the Type I error rate in independent samples t tests: the power of alternatives and recommendations. , 2014, Psychological methods.

[16]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[17]  John P. A. Ioannidis,et al.  p-Curve and p-Hacking in Observational Research , 2016, PloS one.

[18]  Niels G. Waller,et al.  The fallacy of the null hypothesis in soft psychology , 2004 .

[19]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[20]  John P. A. Ioannidis,et al.  Research: increasing value, reducing waste 2 , 2014 .

[21]  Andrew Gelman,et al.  Why We (Usually) Don't Have to Worry About Multiple Comparisons , 2009, 0907.2478.

[22]  Gerd Gigerenzer,et al.  Do Studies of Statistical Power Have an Effect on the Power of Studies? , 2004 .

[23]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[24]  J. Rossi,et al.  Statistical power of psychological research: what have we gained in 20 years? , 1990, Journal of consulting and clinical psychology.

[25]  D. Lykken Statistical significance in psychological research. , 1968, Psychological bulletin.

[26]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[27]  D Curran-Everett,et al.  Multiple comparisons: philosophies and illustrations. , 2000, American journal of physiology. Regulatory, integrative and comparative physiology.

[28]  R. Tibshirani,et al.  Increasing value and reducing waste in research design, conduct, and analysis , 2014, The Lancet.

[29]  Rand R. Wilcox,et al.  How many discoveries have been lost by ignoring modern statistical methods , 1998 .

[30]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[31]  Michael B. Miller,et al.  The principled control of false positives in neuroimaging. , 2009, Social cognitive and affective neuroscience.

[32]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[33]  P. Meehl Theory-Testing in Psychology and Physics: A Methodological Paradox , 1967, Philosophy of Science.

[34]  Yoav Benjamini,et al.  Simultaneous and selective inference: Current successes and future challenges , 2010, Biometrical journal. Biometrische Zeitschrift.

[35]  Daniele Fanelli,et al.  Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.

[36]  J Whitehead,et al.  A unified theory for sequential clinical trials. , 1999, Statistics in medicine.

[37]  D. Fanelli “Positive” Results Increase Down the Hierarchy of the Sciences , 2010, PloS one.

[38]  H. Pashler,et al.  Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition 1 , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[39]  H Merabet,et al.  The design and analysis of sequential clinical trials , 2013 .

[40]  Edward Vul,et al.  Reply to Comments on “Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition” , 2009, Perspectives on psychological science : a journal of the Association for Psychological Science.

[41]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[42]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[43]  G. Gigerenzer,et al.  Do studies of statistical power have an effect on the power of studies , 1989 .

[44]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[45]  Jelle J. Goeman,et al.  Multiple hypothesis testing in genomics , 2014, Statistics in medicine.

[46]  D L DeMets,et al.  Interim analysis: the alpha spending function approach. , 1994, Statistics in medicine.

[47]  John P. A. Ioannidis,et al.  Big data meets public health , 2014, Science.

[48]  Thomas E. Nichols,et al.  Controlling the familywise error rate in functional neuroimaging: a comparative review , 2003, Statistical methods in medical research.

[49]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[50]  Kyung In Kim,et al.  Effects of dependence in high-dimensional multiple testing problems , 2008, BMC Bioinformatics.

[51]  Raphael Silberzahn,et al.  Crowdsourced research: Many hands make tight work , 2015, Nature.

[52]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .