论文信息 - Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability

Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability

Many neural net learning algorithms aim at nding \simple" nets to explain training data. The expectation is: the \simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for \simplicity" that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoo's algorithmic probability. Likewise, most previous approaches (especially those of the \Bayesian" kind) suuer from the problem of choosing appropriate priors. This paper addresses both issues. It rst reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the Solomonoo-Levin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for nding \algorithmically simple" problem solutions with high generalization capability. The method is based on Levin complexity (a time-bounded generalization of Kolmogorov complexity) and inspired by Levin's optimal universal search algorithm. For a given problem, solution candidates are computed by eecient \self-sizing" programs that innuence their own runtime and storage size. The probabilistic search algorithm nds the \good" programs (the ones quickly computing algorithmically probable solutions tting the training data). Simulations focus on the task of discovering \algorithmically simple" neural networks with low Kolmogorov complexity and high generalization capability. It is demonstrated that the method, at least with certain toy problems where it is computationally feasible, can lead to generalization results unmatchable by previous neural net algorithms. Much remains do be done, however, to make large scale applications and \incremental learning" feasible.

Corso Elvezia | urgen SchmidhuberIDSIA

[1] Ray J. Solomonoff,et al. A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2] Gregory J. Chaitin,et al. On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[3] A. Kolmogorov. Three approaches to the quantitative definition of information , 1968 .

[4] Gregory J. Chaitin,et al. On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[5] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[6] G. Chaitin. A Theory of Program Size Formally Identical to Information Theory , 1975, JACM.

[7] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[8] Temple F. Smith. Occam's razor , 1980, Nature.

[9] Juris Hartmanis,et al. Generalized Kolmogorov complexity and the structure of feasible computations , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[10] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[11] J. Rissanen. A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .