Supervised Learning and Statistical Estimation

Supervised learning is characterized by the presence of a teacher — training inputs are provided with the desired outputs. Hence, the neural network to be trained can be viewed as a parameterized mapping from a known input to the output which should be “as close as possible” to the target output. Different formulations of the distance between the network and the desired outputs result in the choice of different cost functions, i.e. optimization criteria. Nevertheless, the main influence on the learning paradigm to be applied comes from assumptions on the nature of the available data. The assumption that the available measurements are representative of a pure deterministic process leads to the problem of function fitting. On the other hand, the assumption that there is an underlying random process which governs the generation of the training data implies the statistical estimation of the unknown process parameters. Although there is a significant conceptual difference between the two mentioned assumptions, there are many instances where they lead to the identical results. In this book we focus our attention on the statistical estimation of the process parameters.

[1]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[2]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[3]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[4]  J. L. Melsa,et al.  Decision and Estimation Theory , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[6]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[7]  Gerhard Paass,et al.  Assessing and Improving Neural Network Predictions by the Bootstrap Algorithm , 1992, NIPS.

[8]  H. Akaike A new look at the statistical model identification , 1974 .

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  Shun-ichi Amari,et al.  Learning Curves, Model Selection and Complexity of Neural Networks , 1992, NIPS.

[11]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[12]  Rupert G. Miller The jackknife-a review , 1974 .

[13]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[14]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[15]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[16]  Yong Liu,et al.  Neural Network Model Selection Using Asymptotic Jackknife Estimator and Cross-Validation Method , 1992, NIPS.

[17]  A. Wald Note on the Consistency of the Maximum Likelihood Estimate , 1949 .

[18]  Jorma Rissanen MDL Modeling — An Introduction , 1994 .

[19]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[20]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[21]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[22]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .