Loss-Based Estimation with Evolutionary Algorithms and Cross-Validation

Statistical estimation in multivariate data sets presents myriad challenges when the form of the regression function linking the outcome and explanatory variables is unknown. Our study seeks to understand the computational challenges of regression estimation’s underlying optimization problem and design intelligent procedures for this setting. We begin by analyzing the size of the parameter space in polynomial regression in terms of the number of variables and the constraints on the polynomial degree and the number of interacting explanatory variables.We subsequently propose a new procedure for statistical estimation that relies upon cross-validation to select the optimal parameter subspace and an evolutionary algorithm to minimize risk within this subspace based upon the available data. This general purpose procedure can be shown to perform well in a variety of challenging multivariate estimation settings. Furthermore, the procedure is sufficiently flexible to allow the user to incorporate known causal structures into the estimate and to adjust computational parameters such as the population mutation rate according to the problem’s specific challenges. Furthermore, the procedure can be shown to asymptotically converge to the globally optimal estimate. We compare this evolutionary algorithm to a variety of competitors over the course of simulation studies and in the context of a study of disease progression in diabetes patients.

[1]  Edward F. McQuarrie,et al.  Focus Groups: Theory and Practice , 1991 .

[2]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[3]  Sandrine Dudoit,et al.  Optimization of the Architecture of Neural Networks Using a Deletion/Substitution/Addition Algorithm , 2005 .

[4]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[5]  D. Fogel Evolutionary algorithms in theory and practice , 1997, Complex..

[6]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  Thomas Bäck,et al.  Evolutionary computation: Toward a new philosophy of machine intelligence , 1997, Complex..

[9]  S. Dudoit,et al.  Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples , 2003 .

[10]  Mark J van der Laan,et al.  Deletion/Substitution/Addition Algorithm in Learning with Applications in Genomics , 2004, Statistical applications in genetics and molecular biology.

[11]  Li-Min Fu Microarray Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[12]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[13]  David B. Fogel,et al.  Evolutionary Computation: Towards a New Philosophy of Machine Intelligence , 1995 .

[14]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Manfred Stoll,et al.  Introduction to Real Analysis , 1997 .

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[19]  J. Friedman Multivariate adaptive regression splines , 1990 .

[20]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[21]  J. Friedman Fast sparse regression and classification , 2012 .

[22]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[23]  Annette M. Molinaro,et al.  Loss-based estimation with cross-validation: applications to microarray data analysis , 2003, SKDD.

[24]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.