What is the question?

Mistaking the type of question being considered is the most common error in data analysis Over the past 2 years, increased focus on statistical analysis brought on by the era of big data has pushed the issue of reproducibility out of the pages of academic journals and into the popular consciousness (1). Just weeks ago, a paper about the relationship between tissue-specific cancer incidence and stem cell divisions (2) was widely misreported because of misunderstandings about the primary statistical argument in the paper (3). Public pressure has contributed to the massive recent adoption of reproducible research tools, with corresponding improvements in reproducibility. But an analysis can be fully reproducible and still be wrong. Even the most spectacularly irreproducible analyses—like those underlying the ongoing lawsuits (4) over failed genomic signatures for chemotherapy assignment (5)—are ultimately reproducible (6). Once an analysis is reproducible, the key question we want to answer is, “Is this data analysis correct?” We have found that the most frequent failure in data analysis is mistaking the type of question being considered.

[1]  B. Vogelstein,et al.  Variation in cancer risk among tissues can be explained by the number of stem cell divisions , 2015, Science.

[2]  Neil E. Caporaso,et al.  Abstract 2157: Causal effects of delaying smoking initiation on subsequent lung cancer risk , 2014 .

[3]  H. Schneider,et al.  Procalcitonin for the clinical laboratory: a review , 2007, Pathology.

[4]  Andrew W. Correia,et al.  Effect of Air Pollution Control on Life Expectancy in the United States: An Analysis of 545 U.S. Counties for the Period from 2000 to 2007 , 2013, Epidemiology.

[5]  H. Dressman,et al.  Genomic signatures to guide the use of chemotherapeutics , 2006, Nature Medicine.

[6]  Jeffrey T Leek,et al.  An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[7]  Acute respiratory infections: the forgotten pandemic. , 1998, Bulletin of the World Health Organization.

[8]  Samir M Fakhry,et al.  The utility of procalcitonin in critically ill trauma patients , 2012, The journal of trauma and acute care surgery.

[9]  Andrew Gelman,et al.  Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. , 2014, Biostatistics.

[10]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[11]  K. Coombes,et al.  Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology , 2009, 1010.1092.