Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability

Many neural net learning algorithms aim at nding \simple" nets to explain training data. The expectation is: the \simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for \simplicity" that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoo's algorithmic probability. Likewise, most previous approaches (especially those of the \Bayesian" kind) suuer from the problem of choosing appropriate priors. This paper addresses both issues. It rst reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the Solomonoo-Levin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for nding \algorithmically simple" problem solutions with high generalization capability. The method is based on Levin complexity (a time-bounded generalization of Kolmogorov complexity) and inspired by Levin's optimal universal search algorithm. For a given problem, solution candidates are computed by eecient \self-sizing" programs that innuence their own runtime and storage size. The probabilistic search algorithm nds the \good" programs (the ones quickly computing algorithmically probable solutions tting the training data). Simulations focus on the task of discovering \algorithmically simple" neural networks with low Kolmogorov complexity and high generalization capability. It is demonstrated that the method, at least with certain toy problems where it is computationally feasible, can lead to generalization results unmatchable by previous neural net algorithms. Much remains do be done, however, to make large scale applications and \incremental learning" feasible.

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[3]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[4]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[5]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[6]  G. Chaitin A Theory of Program Size Formally Identical to Information Theory , 1975, JACM.

[7]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[8]  Temple F. Smith Occam's razor , 1980, Nature.

[9]  Juris Hartmanis,et al.  Generalized Kolmogorov complexity and the structure of feasible computations , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[10]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[12]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[13]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..

[14]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[15]  Ray J. Solomonoff,et al.  The Application of Algorithmic Probability to Problems in Artificial Intelligence , 1985, UAI.

[16]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[17]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[18]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[19]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[20]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[21]  Ming Li,et al.  The Minimum Description Length Principle and Its Application to Online Learning of Handprinted Characters , 1989, IJCAI.

[22]  Ming Li,et al.  A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution , 1989, 30th Annual Symposium on Foundations of Computer Science.

[23]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[24]  Jürgen Schmidhuber,et al.  A local learning algorithm for dynamic feedforward and recurrent networks , 1990, Forschungsberichte, TU Munich.

[25]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[26]  Thomas G. Dietterich Limitations on Inductive Learning , 1989, ML.

[27]  P. Gács,et al.  KOLMOGOROV'S CONTRIBUTIONS TO INFORMATION THEORY AND ALGORITHMIC COMPLEXITY , 1989 .

[28]  Edwin P. D. Pednault,et al.  Some Experiments in Applying Inductive Inference Principles to Surface Reconstruction , 1989, IJCAI.

[29]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[30]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[31]  Stephen I. Gallant,et al.  A connectionist learning algorithm with provable generalization and scaling bounds , 1990, Neural Networks.

[32]  Barak A. Pearlmutter,et al.  Chaitin-Kolmogorov Complexity and Generalization in Neural Networks , 1990, NIPS.

[33]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[34]  Suzanna Becker,et al.  Unsupervised Learning Procedures for Neural Networks , 1991, Int. J. Neural Syst..

[35]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[36]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[37]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[38]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[39]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[40]  Osamu Watanabe,et al.  Kolmogorov Complexity and Computational Complexity , 2012, EATCS Monographs on Theoretical Computer Science.

[41]  Zhaoping Li,et al.  Understanding Retinal Color Coding from First Principles , 1992, Neural Computation.

[42]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[43]  J. Urgen Schmidhuber,et al.  Learning Factorial Codes by Predictability Minimization , 1992, Neural Computation.

[44]  E. Allender Applications of Time-Bounded Kolmogorov Complexity in Complexity Theory , 1992 .

[45]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[46]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[47]  Rolf Herken,et al.  The Universal Turing Machine: A Half-Century Survey , 1992 .

[48]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[49]  J. Schmidhuber Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets , 1993 .

[50]  Jürgen Schmidhuber,et al.  A ‘Self-Referential’ Weight Matrix , 1993 .

[51]  Gustavo Deco,et al.  Elimination of Overtraining by a Mutual Information Network , 1993 .

[52]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[53]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[54]  J. Urgen Schmidhuber Discovering Problem Solutions with Low Kolmogorov Complexity and High Generalization Capability , 1994 .

[55]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[56]  Wolfgang Maass,et al.  Perspectives of Current Research about the Complexity of Learning on Neural Nets , 1994 .

[57]  Wolfgang J. Paul,et al.  Autonomous theory building systems , 1995, Ann. Oper. Res..

[58]  Jürgen Schmidhuber,et al.  Solving POMDPs with Levin Search and EIRA , 1996, ICML.

[59]  Juergen Schmidhuber,et al.  Incremental self-improvement for life-time multi-agent reinforcement learning , 1996 .

[60]  Jieyu Zhao,et al.  Simple Principles of Metalearning , 1996 .

[61]  Juergen Schmidhuber,et al.  A General Method For Incremental Self-Improvement And Multi-Agent Learning In Unrestricted Environme , 1999 .

[62]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.