Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated bya given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical communityhas been committed to the almost exclusive use of data models. This commit- ment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems. Algorithmic modeling, both in theoryand practice, has developed rapidlyin fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools.

[1]  W. Beveridge The Art Of Scientific Investigation , 1957 .

[2]  William S. Meisel,et al.  Computer-oriented approaches to pattern recognition , 1972 .

[3]  J. W. Gorman,et al.  Fitting Equations to Data. , 1973 .

[4]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[5]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[6]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[7]  E. Parzen Nonparametric Statistical Data Modeling , 1979 .

[8]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[9]  D. Pregibon,et al.  Graphical Methods for Assessing Logistic Regression Models , 1984 .

[10]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[11]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[12]  Cheng Hsiao,et al.  A Combined Structural and Flexible Functional Approach for Modeling Energy Substitution , 1989 .

[13]  D. Freedman Statistical models and shoe leather , 1989 .

[14]  G. Wahba Spline models for observational data , 1990 .

[15]  D. Freedman Statistical models and shoe leather , 1989 .

[16]  I. Jolliffe,et al.  Nonlinear Multivariate Analysis , 1992 .

[17]  Leo Breiman The 1991 Census Adjustment: Undercount or Bad Data? , 1994 .

[18]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[19]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[20]  J. Berger Discussion of David Freedman’s “Some Issues in the Foundations of Statistics” , 1995 .

[21]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[22]  D. Freedman From Association to Causation via Regression , 1997 .

[23]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[24]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[26]  L. Breiman Arcing Classifiers , 1998 .

[27]  A. Dempster Logicist statistics. I. Models and modeling , 1998 .

[28]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  R. Tibshirani,et al.  The problem of regions , 1998 .

[30]  Dean P. Foster,et al.  Fitting Equations to Data , 1998 .

[31]  Pedro M. Domingos Occam's Two Razors: The Sharp and the Blunt , 1998, KDD.

[32]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[33]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  R. W. Oldford,et al.  Scientific Method, Statistical Method and the Speed of Light , 2000 .

[36]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[37]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[38]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .