Best (but oft-forgotten) practices: the multiple problems of multiplicity-whether and how to correct for many statistical tests.

Testing many null hypotheses in a single study results in an increased probability of detecting a significant finding just by chance (the problem of multiplicity). Debates have raged over many years with regard to whether to correct for multiplicity and, if so, how it should be done. This article first discusses how multiple tests lead to an inflation of the α level, then explores the following different contexts in which multiplicity arises: testing for baseline differences in various types of studies, having >1 outcome variable, conducting statistical tests that produce >1 P value, taking multiple "peeks" at the data, and unplanned, post hoc analyses (i.e., "data dredging," "fishing expeditions," or "P-hacking"). It then discusses some of the methods that have been proposed for correcting for multiplicity, including single-step procedures (e.g., Bonferroni); multistep procedures, such as those of Holm, Hochberg, and Šidák; false discovery rate control; and resampling approaches. Note that these various approaches describe different aspects and are not necessarily mutually exclusive. For example, resampling methods could be used to control the false discovery rate or the family-wise error rate (as defined later in this article). However, the use of one of these approaches presupposes that we should correct for multiplicity, which is not universally accepted, and the article presents the arguments for and against such "correction." The final section brings together these threads and presents suggestions with regard to when it makes sense to apply the corrections and how to do so.

[1]  C Roberts,et al.  Baseline imbalance in randomised controlled trials , 1999, BMJ.

[2]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[3]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[4]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[5]  O. J. Dunn Estimation of the Medians for Dependent Variables , 1959 .

[6]  K J Rothman,et al.  No Adjustments Are Needed for Multiple Comparisons , 1990, Epidemiology.

[7]  L. Cronbach The two disciplines of scientific psychology. , 1957 .

[8]  The University Group Diabetes Program. A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. V. Evaluation of pheniformin therapy. , 1975, Diabetes.

[9]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[10]  Greg J Duncan,et al.  Neighborhoods, obesity, and diabetes--a randomized social experiment. , 2011, The New England journal of medicine.

[11]  Thomas Catalano Statistical Concepts for the Analytical Chemist , 2013 .

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  S. Young,et al.  p Value Adjustments for Multiple Tests in Multivariate Binomial Models , 1989 .

[14]  P. Savage,et al.  Metabolic Syndrome and Cardiovascular Disease in Older People: The Cardiovascular Health Study , 2006, Journal of the American Geriatrics Society.

[15]  S. Pocock Group sequential methods in the design and analysis of clinical trials , 1977 .

[16]  Joel R. Levin,et al.  New developments in pairwise multiple comparisons : some powerful and practicable procedures , 1991 .

[17]  M Pagano,et al.  Multiple comparisons: a cautionary tale about the dangers of fishing expeditions. , 1999, Nutrition.

[18]  Geoffrey R. Norman,et al.  Biostatistics: The Bare Essentials , 1993 .

[19]  C. Coffey You may have worked on more adaptive designs than you think. , 2015, Stroke.

[20]  J A Nelder,et al.  Statistics in medical journals: some recent trends. , 2001, Statistics in medicine.

[21]  P. Armitage,et al.  Repeated Significance Tests on Accumulating Data , 1969 .

[22]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[23]  S. Pocock,et al.  Trials stopped early: too good to be true? , 1999, The Lancet.

[24]  S. Bowalekar Adaptive designs in clinical trials , 2011, Perspectives in clinical research.

[25]  P. Armitage,et al.  Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. , 1976, British Journal of Cancer.

[26]  L A Moyé,et al.  P-value interpretation and alpha allocation in clinical trials. , 1998, Annals of epidemiology.

[27]  P. O'Brien,et al.  A multiple testing procedure for clinical trials. , 1979, Biometrics.

[28]  Jacob Cohen Multiple regression as a general data-analytic system. , 1968 .

[29]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[30]  Jacob Cohen The earth is round (p < .05) , 1994 .

[31]  Susan Michie,et al.  Developing and Evaluating Complex Interventions , 2015 .

[32]  Wojciech Zareba,et al.  Prophylactic implantation of a defibrillator in patients with myocardial infarction and reduced ejection fraction. , 2002, The New England journal of medicine.

[33]  D. Altman Comparability of Randomised Groups , 1985 .

[34]  C. Meinert,et al.  A study of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. II. Mortality results. , 1970, Diabetes.

[35]  CM Bennett,et al.  Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: an argument for multiple comparisons correction , 2009, NeuroImage.

[36]  C. Reynolds,et al.  Comparisons of methods for multiple hypothesis testing in neuropsychological research. , 2009, Neuropsychology.

[37]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[38]  Peter C Austin,et al.  Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. , 2006, Journal of clinical epidemiology.

[39]  L. Mariani,et al.  Perioperative total parenteral nutrition in malnourished, gastrointestinal cancer patients: a randomized, clinical trial. , 2000, JPEN. Journal of parenteral and enteral nutrition.

[40]  M. Pike,et al.  Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. analysis and examples. , 1977, British Journal of Cancer.

[41]  Michael B. Miller,et al.  of Serendipitous and Unexpected Results Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon : An Argument For Proper Multiple Comparisons Correction , 2010 .

[42]  Kenneth F Schulz,et al.  Multiplicity in randomised trials I: endpoints and treatments , 2005, The Lancet.

[43]  Thomas H. Lee,et al.  Taking AIM at a Moving Target , 2012 .