A theory of generalization in learning machines with neural network applications

This thesis presents a new theory of generalization in neural network types of learning machines. The new theory can be viewed as a refinement of the decision theoretical framework of learning based on the uniform weak law in probability theory (i.e., the VC-method), and leads to a finer degree of approximation hitherto available. The role played by the VC-theory in studying learning problems becomes evident in the new framework. Indeed, the intrinsic limitation of the VC-theory in assessment of generalization error is demonstrated. The focus is on assessment and improvement of generalization performance when there is only a finite number of examples. In a unified framework, the theory provides systematic answers to the problems of learnability, assessment of generalization error, temporal dynamics of generalization, and design of machine complexity. Under conditions weaker than those required for distribution-free (or Probably Approximately Correct learning), it proves a kind of learnability for both fixed and varying machine structures, and gives rates of growth of machine size for attaining learnability in the latter case. The theory introduces a new method for assessing the generalization performance, and obtains estimates of the generalization error in both post-training and during the training process for general linear and nonlinear machines. These results contribute to the problem of how generalization error is related to the number of examples and machine complexity, and provide answers to the open problems of when learning should be stopped and how the complexity of the machine affects the generalization error during the training process; thus providing a precise language for describing the over-training phenomenon. The results on generalization error estimation lead to criteria for choosing correct size of machines and optimal stopping time simultaneously so that near optimal generalization performance is attained. These criteria find connections with the Akaike's Information Criterion and Minimal Description Length Principle for machine size selection, and shed light on the properties of the latter. The effects of regularization on the generalization error as well as the relation between regularization and early stopping are analyzed. These results in turn provide guidelines for choosing the regularization function. The results of this thesis are relevant to problems of regression, pattern recognition, statistical function estimation, and stochastic approximation.