The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index.

OBJECTIVES A P-value <0.05 is one metric used to evaluate the results of a randomized controlled trial (RCT). We wondered how often statistically significant results in RCTs may be lost with small changes in the numbers of outcomes. STUDY DESIGN AND SETTING A review of RCTs in high-impact medical journals that reported a statistically significant result for at least one dichotomous or time-to-event outcome in the abstract. In the group with the smallest number of events, we changed the status of patients without an event to an event until the P-value exceeded 0.05. We labeled this number the Fragility Index; smaller numbers indicated a more fragile result. RESULTS The 399 eligible trials had a median sample size of 682 patients (range: 15-112,604) and a median of 112 events (range: 8-5,142); 53% reported a P-value <0.01. The median Fragility Index was 8 (range: 0-109); 25% had a Fragility Index of 3 or less. In 53% of trials, the Fragility Index was less than the number of patients lost to follow-up. CONCLUSION The statistically significant results of many RCTs hinge on small numbers of events. The Fragility Index complements the P-value and helps identify less robust results.

[1]  A. Tversky,et al.  On the psychology of prediction , 1973 .

[2]  Kristian Thorlund,et al.  The Number of Patients and Events Required to Limit the Risk of Overestimation of Intervention Effects in Meta-Analysis—A Simulation Study , 2011, PloS one.

[3]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[4]  S D Walter,et al.  Statistical significance and fragility criteria for assessing a difference of two proportions. , 1991, Journal of clinical epidemiology.

[5]  R. Peto,et al.  Beta blockade during and after myocardial infarction: an overview of the randomized trials. , 1985, Progress in cardiovascular diseases.

[6]  Raymond C. Schneider,et al.  ISIS-4: A randomised factorial trial assessing early oral captopril, oral mononitrate, and intravenous magnesium sulphate in 58 050 patients with suspected acute myocardial infarction , 1995, The Lancet.

[7]  D. Sackett,et al.  Controversy in counting and attributing events in clinical trials. , 1979, The New England journal of medicine.

[8]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[9]  Jonathan A C Sterne,et al.  Sifting the evidence—what's wrong with significance tests? , 2001, BMJ : British Medical Journal.

[10]  A R Feinstein,et al.  P-values and confidence intervals: two sides of the same unsatisfactory coin. , 1998, Journal of clinical epidemiology.

[11]  J. Ioannidis Contradicted and initially stronger effects in highly cited clinical research. , 2005, JAMA.

[12]  S. Pocock Current issues in the design and interpretation of clinical trials. , 1985, British medical journal.

[13]  A R Feinstein,et al.  The unit fragility index: an additional appraisal of "statistical significance" for a contrast of two proportions. , 1990, Journal of clinical epidemiology.

[14]  S. Fletcher,et al.  Intravenous magnesium sulphate in suspected acute myocardial infarction: results of the second Leicester Intravenous Magnesium Intervention Trial (LIMIT-2) , 1992, The Lancet.