论文信息 - Prequential and Cross-Validated Regression Estimation

Prequential and Cross-Validated Regression Estimation

Prequential model selection and delete-one cross-validation are data-driven methodologies for choosing between rival models on the basis of their predictive abilities. For a given set of observations, the predictive ability of a model is measured by the model's accumulated prediction error and by the model's average-out-of-sample prediction error, respectively, for prequential model selection and for cross-validation. In this paper, given i.i.d. observations, we propose nonparametric regression estimators—based on neural networks—that select the number of “hidden units” (or “neurons”) using either prequential model selection or delete-one cross-validation. As our main contributions: (i) we establish rates of convergence for the integrated mean-squared errors in estimating the regression function using “off-line” or “batch” versions of the proposed estimators and (ii) we establish rates of convergence for the time-averaged expected prediction errors in using “on-line” versions of the proposed estimators. We also present computer simulations (i) empirically validating the proposed estimators and (ii) empirically comparing the proposed estimators with certain novel prequential and cross-validated “mixture” regression estimators.

Dharmendra S. Modha | Elias Masry | E. Masry | D. Modha

[1] W. Wong,et al. Convergence Rate of Sieve Estimates , 1994 .

[2] Dharmendra S. Modha,et al. Memory-Universal Prediction of Stationary Random Processes , 1998, IEEE Trans. Inf. Theory.

[3] Jukka Saarinen,et al. Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks , 1996, Neural Computation.

[4] G. Lugosi,et al. Adaptive Model Selection Using Empirical Complexities , 1998 .

[5] Jorma Rissanen,et al. Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[6] Jorma Rissanen,et al. A Predictive Least-Squares Principle , 1986 .

[7] J. Rissanen. A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[8] Halbert White,et al. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[9] G. Lugosi,et al. Concept learning using complexity regularization , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[10] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11] P. Massart,et al. Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[12] M. Stone. An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[13] Halbert White,et al. Sup-norm approximation bounds for networks through probabilistic methods , 1995, IEEE Trans. Inf. Theory.

[14] Daniel F. McCaffrey,et al. Convergence rates for single hidden layer feedforward networks , 1994, Neural Networks.

[15] P. Massart,et al. From Model Selection to Adaptive Estimation , 1997 .

[16] David Haussler,et al. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[17] Dharmendra S. Modha,et al. Minimum complexity regression estimation with weakly dependent observations , 1996, IEEE Trans. Inf. Theory.

[18] David Haussler,et al. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[19] Kurt Hornik,et al. Degree of Approximation Results for Feedforward Networks Approximating Unknown Mappings and Their Derivatives , 1994, Neural Computation.

[20] Jorma Rissanen,et al. Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[21] T. Speed,et al. Data compression and histograms , 1992 .

[22] A. P. Dawid,et al. Prequential data analysis , 1992 .

[23] C. J. Stone,et al. An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates , 1984 .

[24] Dilip Sarkar,et al. Methods to speed up error back-propagation learning algorithm , 1995, CSUR.

[25] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26] V. Vapnik. Estimation of Dependences Based on Empirical Data , 2006 .

[27] L. K. Jones,et al. The computational intractability of training sigmoidal neural networks , 1997, IEEE Trans. Inf. Theory.

[28] R. Schapire,et al. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[29] Michael Kearns,et al. A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[30] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[31] Leo Breiman,et al. Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[32] Jorma Rissanen,et al. Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[33] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[34] A. Barron. Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[35] M. Stone. Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[36] Ker-Chau Li,et al. Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .

[37] P. Massart,et al. Risk bounds for model selection via penalization , 1999 .