A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results

Inference is the process of using facts we know to learn about facts we do not know. A theory of inference gives assumptions necessary to get from the former to the latter, along with a definition for and summary of the resulting uncertainty. Any one theory of inference is neither right nor wrong, but merely an axiom that may or may not be useful. Each of the many diverse theories of inference can be valuable for certain applications. However, no existing theory of inference addresses the tendency to choose, from the range of plausible data analysis specifications consistent with prior evidence, those that inadvertently favor one's own hypotheses. Since the biases from these choices are a growing concern across scientific fields, and in a sense the reason the scientific community was invented in the first place, we introduce a new theory of inference designed to address this critical problem. We derive "hacking intervals," which are the range of a summary statistic one may obtain given a class of possible endogenous manipulations of the data. Hacking intervals require no appeal to hypothetical data sets drawn from imaginary superpopulations. A scientific result with a small hacking interval is more robust to researcher manipulation than one with a larger interval, and is often easier to interpret than a classical confidence interval. Some versions of hacking intervals turn out to be equivalent to classical confidence intervals, which means they may also provide a more intuitive and potentially more useful interpretation of classical confidence intervals

[1]  Cosma Rohilla Shalizi,et al.  Philosophy and the practice of Bayesian statistics. , 2010, The British journal of mathematical and statistical psychology.

[2]  R. Kronmal,et al.  Assessing the sensitivity of regression results to unmeasured confounders in observational studies. , 1998, Biometrics.

[3]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[4]  Cynthia Rudin,et al.  Prediction uncertainty and optimal experimental design for learning dynamical systems. , 2015, Chaos.

[5]  Joe Wiart,et al.  Surrogate models for uncertainty quantification: An overview , 2017, 2017 11th European Conference on Antennas and Propagation (EUCAP).

[6]  Houman Owhadi,et al.  Handbook of Uncertainty Quantification , 2017 .

[7]  Jeffrey T Leek,et al.  An estimate of the science-wise false discovery rate and application to the top medical literature. , 2014, Biostatistics.

[8]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[9]  Cynthia Rudin,et al.  Robust Optimization using Machine Learning for Uncertainty Sets , 2014, ISAIM.

[10]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[11]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[12]  Dorothy V M Bishop,et al.  Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value , 2016, PeerJ.

[13]  Robbie C. M. van Aert,et al.  Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking , 2016, Front. Psychol..

[14]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[15]  James O. Berger,et al.  An overview of robust Bayesian analysis , 1994 .

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  Edward E. Leamer,et al.  Let's Take the Con Out of Econometrics , 1983 .

[18]  R. Cook Detection of influential observation in linear regression , 2000 .

[19]  R. Lanfear,et al.  The Extent and Consequences of P-Hacking in Science , 2015, PLoS biology.

[20]  Cynthia Rudin,et al.  The age of secrecy and unfairness in recidivism prediction , 2018, 2.1.

[21]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[22]  Anthony G. Greenwald,et al.  Blindspot: Hidden Biases of Good People , 2013 .

[23]  Dennis S. Bernstein,et al.  Matrix Mathematics: Theory, Facts, and Formulas with Application to Linear Systems Theory , 2005 .

[24]  R. Berk A Primer on Fairness in Criminal Justice Risk Assessments , 2016 .

[25]  Tyler J. VanderWeele,et al.  Sensitivity Analysis Without Assumptions , 2015, Epidemiology.

[26]  Gideon Nave,et al.  Evaluating replicability of laboratory experiments in economics , 2016, Science.

[27]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[28]  A. P. Dempster,et al.  Henry Scheffé, The Analysis of Variance , 1960 .

[29]  E. C. Hammond,et al.  Smoking and lung cancer: recent evidence and a discussion of some questions. , 1959, Journal of the National Cancer Institute.

[30]  J. Xu,et al.  Principal Component Analysis based Feature Selection for clustering , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[31]  L. Myer The American Society for Quality , 2003 .

[32]  Gary King,et al.  The Dangers of Extreme Counterfactuals , 2006, Political Analysis.

[33]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[34]  Cynthia Rudin,et al.  All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously , 2019, J. Mach. Learn. Res..

[35]  Cynthia Rudin,et al.  On combining machine learning with decision making , 2011, Machine Learning.

[36]  Randy C. S. Lai,et al.  Generalized Fiducial Inference: A Review and New Results , 2016 .

[37]  Onyebuchi A Arah,et al.  Bias Formulas for Sensitivity Analysis of Unmeasured Confounding for General Outcomes, Treatments, and Confounders , 2011, Epidemiology.

[38]  Diane Crawford,et al.  Editorial , 2000, CACM.

[39]  A. Gelman,et al.  The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗ , 2019 .

[40]  Md. Noor-E-Alam,et al.  Robust Testing for Causal Inference in Observational Studies , 2015 .

[41]  G. King,et al.  Multivariate Matching Methods That Are Monotonic Imbalance Bounding , 2011 .

[42]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[43]  D. Kahneman Thinking, Fast and Slow , 2011 .

[44]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[45]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[46]  Cynthia Rudin,et al.  Machine learning with operational costs , 2011, J. Mach. Learn. Res..

[47]  C. Glenn Begley,et al.  Raise standards for preclinical cancer research , 2012 .

[48]  Cristobal Young,et al.  Model Uncertainty and Robustness , 2017 .

[49]  Cynthia Rudin,et al.  Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the "Rashomon" Perspective , 2018 .

[50]  Md. Noor-E-Alam,et al.  Hypothesis Tests That Are Robust to Choice of Matching Method , 2018 .

[51]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[52]  Timothy D. Wilson,et al.  Comment on “Estimating the reproducibility of psychological science” , 2016, Science.

[53]  Cristobal Young,et al.  We Ran 9 Billion Regressions: Eliminating False Positives through Computational Model Robustness , 2018, Sociological Methodology.

[54]  James E. Monogan Research Preregistration in Political Science: The Case, Counterarguments, and a Response to Critiques , 2015, PS: Political Science & Politics.

[55]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[56]  Kilian Q. Weinberger,et al.  Metric Learning for Kernel Regression , 2007, AISTATS.

[57]  W. Krzanowski Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components , 1987 .

[58]  Md. Noor-E-Alam,et al.  Robust Nonparametric Testing for Causal Inference in Observational Studies , 2015 .

[59]  Desire L. Massart,et al.  Feature selection in principal component analysis of analytical data , 2002 .

[60]  F. Prinz,et al.  Believe it or not: how much can we rely on published data on potential drug targets? , 2011, Nature Reviews Drug Discovery.

[61]  Cynthia Rudin,et al.  Interpretable classification models for recidivism prediction , 2015, 1503.07810.

[62]  J. F. Bjørnstad Predictive Likelihood: A Review , 1990 .

[63]  Andrew Gelman,et al.  Why We (Usually) Don't Have to Worry About Multiple Comparisons , 2009, 0907.2478.

[64]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[65]  Elizabeth A. Stuart,et al.  An Introduction to Sensitivity Analysis for Unobserved Confounding in Nonexperimental Prevention Research , 2013, Prevention Science.

[66]  I. Jolliffe Discarding Variables in a Principal Component Analysis. Ii: Real Data , 1973 .

[67]  H. Scheffé The Analysis of Variance , 1960 .

[68]  M. Kearns,et al.  Fairness in Criminal Justice Risk Assessments: The State of the Art , 2017, Sociological Methods & Research.

[69]  J. Berger Robust Bayesian analysis : sensitivity to the prior , 1990 .

[70]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[71]  Edward E. Leamer,et al.  Extreme Bounds Analysis , 2010 .

[72]  COMPAS Risk Scales : Demonstrating Accuracy Equity and Predictive Parity Performance of the COMPAS Risk Scales in Broward County , 2016 .

[73]  D. Bernstein Matrix Mathematics: Theory, Facts, and Formulas , 2009 .

[74]  Macartan Humphreys,et al.  Fishing, Commitment, and Communication: A Proposal for Comprehensive Nonbinding Research Registration , 2012, Political Analysis.

[75]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[76]  Leif D. Nelson,et al.  Specification Curve: Descriptive and Inferential Statistics on All Reasonable Specifications , 2015 .