The power of noise and the art of prediction

Data analysis usually aims to identify a particular signal, such as an intervention effect. Conventional analyses often assume a specific data generation process, which implies a theoretical model that best fits the data. Machine learning techniques do not make such an assumption. In fact, they encourage multiple models to compete on the same data. Applying logistic regression and machine learning algorithms to real and simulated datasets with different features of noise and signal, we demonstrate that no single model dominates others under all circumstances. By showing when different models shine or struggle, we argue that it is important to conduct predictive analyses using cross-validation for better evidence that informs decision making.

[1]  Galit Shmueli,et al.  Predictive Analytics in Information Systems Research , 2010, MIS Q..

[2]  F. Song,et al.  Cumulative subgroup analysis to reduce waste in clinical research for individualised medicine , 2016, BMC Medicine.

[3]  Dana Tulodziecki,et al.  A case study in explanatory power: John Snow's conclusions about the pathology and transmission of cholera. , 2011, Studies in history and philosophy of biological and biomedical sciences.

[4]  D. Rubin,et al.  ON LORD'S PARADOX , 1982 .

[5]  V. S. Subrahmanian,et al.  Predicting human behavior: The next frontiers , 2017, Science.

[6]  David Lazer,et al.  Improving election prediction internationally , 2017, Science.

[7]  Sandy Oliver,et al.  Damned if you do, damned if you don't: subgroup analysis and equity , 2011, Journal of Epidemiology & Community Health.

[8]  Stephen W Lagakos,et al.  Statistics in medicine--reporting of subgroup analyses in clinical trials. , 2007, The New England journal of medicine.

[9]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[10]  Cathy O'Neil,et al.  Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , 2016, Vikalpa: The Journal for Decision Makers.

[11]  D. Donoho 50 Years of Data Science , 2017 .

[12]  M. Gilthorpe,et al.  A structural equation modelling approach to the analysis of change. , 2008, European journal of oral sciences.

[13]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[14]  S. Higgins,et al.  An Empirical Unravelling of Lord's Paradox , 2017, 1708.03796.

[15]  David C. Reutens,et al.  Elemental Spatial and Temporal Association Formation in Left Temporal Lobe Epilepsy , 2014, PloS one.

[16]  Adetayo Kasim,et al.  Same difference? Understanding variation in the estimation of effect sizes from educational trials , 2016 .

[17]  H. Wainer,et al.  Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data , 2004 .

[18]  Philip E. Tetlock,et al.  Bringing probability judgments into policy debates via forecasting tournaments , 2017, Science.

[19]  Jake M. Hofman,et al.  Prediction and explanation in social systems , 2017, Science.

[20]  F. Lord A paradox in the interpretation of group comparisons. , 1967, Psychological bulletin.

[21]  Adam Kapelner,et al.  Inference for the Effectiveness of Personalized Medicine with Software , 2014 .

[22]  D. Trafimow,et al.  The alleged tradeoff between explanatory breadth and predictive power , 2015 .

[23]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[24]  S. Assmann,et al.  Subgroup analysis and other (mis)uses of baseline data in clinical trials , 2000, The Lancet.

[25]  R. Merton The unanticipated consequences of purposive social action , 1936 .

[26]  Zachary D. Cohen,et al.  The Personalized Advantage Index: Translating Research on Prediction into Individualized Treatment Recommendations. A Demonstration , 2014, PloS one.

[27]  Frederic M. Lord,et al.  Statistical adjustments when comparing preexisting groups. , 1969 .

[28]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[29]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[30]  Galit Shmueli,et al.  To Explain or To Predict? , 2010 .

[31]  Teun Bousema,et al.  Gametocyte carriage in uncomplicated Plasmodium falciparum malaria following treatment with artemisinin combination therapy: a systematic review and meta-analysis of individual patient data , 2016, BMC Medicine.

[32]  H. Wainer Adjusting for differential base rates: Lord's paradox again. , 1991, Psychological bulletin.

[33]  Editors-in-chief,et al.  Encyclopedia of statistics in behavioral science , 2005 .

[34]  Andrew Gelman,et al.  Measurement error and the replication crisis , 2017, Science.

[35]  Yu-Kang Tu,et al.  Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon – the reversal paradox , 2008, Emerging themes in epidemiology.

[36]  S. Lagakos The challenge of subgroup analyses--reporting without distorting. , 2006, The New England journal of medicine.

[37]  Isaac CH Fung,et al.  Open access for the non-English-speaking world: overcoming the language barrier , 2008, Emerging themes in epidemiology.

[38]  Adetayo Kasim,et al.  Seeing is Believing: Impact Visualisation in Educational Interventions , 2017 .

[39]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .