Prequential and Cross-Validated Regression Estimation

Prequential model selection and delete-one cross-validation are data-driven methodologies for choosing between rival models on the basis of their predictive abilities. For a given set of observations, the predictive ability of a model is measured by the model's accumulated prediction error and by the model's average-out-of-sample prediction error, respectively, for prequential model selection and for cross-validation. In this paper, given i.i.d. observations, we propose nonparametric regression estimators—based on neural networks—that select the number of “hidden units” (or “neurons”) using either prequential model selection or delete-one cross-validation. As our main contributions: (i) we establish rates of convergence for the integrated mean-squared errors in estimating the regression function using “off-line” or “batch” versions of the proposed estimators and (ii) we establish rates of convergence for the time-averaged expected prediction errors in using “on-line” versions of the proposed estimators. We also present computer simulations (i) empirically validating the proposed estimators and (ii) empirically comparing the proposed estimators with certain novel prequential and cross-validated “mixture” regression estimators.

[1]  W. Wong,et al.  Convergence Rate of Sieve Estimates , 1994 .

[2]  Dharmendra S. Modha,et al.  Memory-Universal Prediction of Stationary Random Processes , 1998, IEEE Trans. Inf. Theory.

[3]  Jukka Saarinen,et al.  Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks , 1996, Neural Computation.

[4]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[5]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[6]  Jorma Rissanen,et al.  A Predictive Least-Squares Principle , 1986 .

[7]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[8]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[9]  G. Lugosi,et al.  Concept learning using complexity regularization , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[12]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[13]  Halbert White,et al.  Sup-norm approximation bounds for networks through probabilistic methods , 1995, IEEE Trans. Inf. Theory.

[14]  Daniel F. McCaffrey,et al.  Convergence rates for single hidden layer feedforward networks , 1994, Neural Networks.

[15]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[16]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[17]  Dharmendra S. Modha,et al.  Minimum complexity regression estimation with weakly dependent observations , 1996, IEEE Trans. Inf. Theory.

[18]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[19]  Kurt Hornik,et al.  Degree of Approximation Results for Feedforward Networks Approximating Unknown Mappings and Their Derivatives , 1994, Neural Computation.

[20]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[21]  T. Speed,et al.  Data compression and histograms , 1992 .

[22]  A. P. Dawid,et al.  Prequential data analysis , 1992 .

[23]  C. J. Stone,et al.  An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates , 1984 .

[24]  Dilip Sarkar,et al.  Methods to speed up error back-propagation learning algorithm , 1995, CSUR.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[27]  L. K. Jones,et al.  The computational intractability of training sigmoidal neural networks , 1997, IEEE Trans. Inf. Theory.

[28]  R. Schapire,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[29]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[30]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[31]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[32]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[33]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[34]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[35]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[36]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .

[37]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .