Discovering Problem Solutions with Low Kolmogorov Complexity and High Generalization Capability

Many machine learning algorithms aim at nding \simple" rules to explain training data. The expectation is: the \simpler" the rules, the better the generalization on test data (! Occam's razor). Most practical implementations, however, use measures for \simplicity" that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoo's algorithmic probability. Likewise, most previous approaches (especially those of the \Bayesian" kind) suuer from the problem of choosing appropriate priors. This paper addresses both issues. It rst reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the Solomonoo-Levin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for nding \algorithmically simple" problem solutions with high generalization capability. The method is based on Levin complexity (a time-bounded generalization of Kolmogorov complexity) and inspired by Levin's optimal universal search algorithm. With a given problem, solution candidates are computed by eecient \self-sizing" programs that innuence their own runtime and storage size. The probabilistic search algorithm nds the \good" programs (the ones quickly computing algorithmically probable solutions tting the training data). Simulations focus on the task of discovering \algorithmically simple" neural networks with low Kolmogorov complexity and high generalization capability. It is demonstrated that the method, at least with certain toy problems where it is computationally feasible, can lead to generalization results unmatchable by previous neural net algorithms. Much remains do be done, however, to make large scale applications and \incremental learning" feasible.

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[3]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[4]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[5]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[6]  L. Levin,et al.  THE COMPLEXITY OF FINITE OBJECTS AND THE DEVELOPMENT OF THE CONCEPTS OF INFORMATION AND RANDOMNESS BY MEANS OF THE THEORY OF ALGORITHMS , 1970 .

[7]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[8]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[9]  G. Chaitin A Theory of Program Size Formally Identical to Information Theory , 1975, JACM.

[10]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[11]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[12]  Juris Hartmanis,et al.  Generalized Kolmogorov complexity and the structure of feasible computations , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[13]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[14]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[15]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[16]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..

[17]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[18]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[19]  Ray J. Solomonoff,et al.  The Application of Algorithmic Probability to Problems in Artificial Intelligence , 1985, UAI.

[20]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[21]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[22]  Gregory. J. Chaitin,et al.  Algorithmic information theory , 1987, Cambridge tracts in theoretical computer science.

[23]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[24]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[25]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[26]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[27]  Ming Li,et al.  The Minimum Description Length Principle and Its Application to Online Learning of Handprinted Characters , 1989, IJCAI.

[28]  Ming Li,et al.  A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution , 1989, 30th Annual Symposium on Foundations of Computer Science.

[29]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[30]  C. Watkins Learning from delayed rewards , 1989 .

[31]  Jürgen Schmidhuber,et al.  A local learning algorithm for dynamic feedforward and recurrent networks , 1990, Forschungsberichte, TU Munich.

[32]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[33]  Thomas G. Dietterich Limitations on Inductive Learning , 1989, ML.

[34]  P. Gács,et al.  KOLMOGOROV'S CONTRIBUTIONS TO INFORMATION THEORY AND ALGORITHMIC COMPLEXITY , 1989 .

[35]  Ray J. Solomonofi,et al.  A SYSTEM FOR INCREMENTAL LEARNING BASED ON ALGORITHMIC PROBABILITY , 1989 .

[36]  Edwin P. D. Pednault,et al.  Some Experiments in Applying Inductive Inference Principles to Surface Reconstruction , 1989, IJCAI.

[37]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[38]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[39]  Barak A. Pearlmutter,et al.  Chaitin-Kolmogorov Complexity and Generalization in Neural Networks , 1990, NIPS.

[40]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[41]  Suzanna Becker,et al.  Unsupervised Learning Procedures for Neural Networks , 1991, Int. J. Neural Syst..

[42]  John R. Koza,et al.  Genetic evolution and co-evolution of computer programs , 1991 .

[43]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[44]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[45]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[46]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[47]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[48]  Osamu Watanabe,et al.  Kolmogorov Complexity and Computational Complexity , 2012, EATCS Monographs on Theoretical Computer Science.

[49]  Zhaoping Li,et al.  Understanding Retinal Color Coding from First Principles , 1992, Neural Computation.

[50]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[51]  E. Allender Applications of Time-Bounded Kolmogorov Complexity in Complexity Theory , 1992 .

[52]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[53]  Jürgen Schmidhuber,et al.  Learning Factorial Codes by Predictability Minimization , 1992, Neural Computation.

[54]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[55]  Rolf Herken,et al.  The Universal Turing Machine: A Half-Century Survey , 1992 .

[56]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[57]  J. Schmidhuber Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets , 1993 .

[58]  Jürgen Schmidhuber,et al.  A ‘Self-Referential’ Weight Matrix , 1993 .

[59]  Gustavo Deco,et al.  Elimination of Overtraining by a Mutual Information Network , 1993 .

[60]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[61]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[62]  A. Milosavljevic,et al.  Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution , 2004, Machine Learning.

[63]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[64]  Jürgen Schmidhuber,et al.  Simplifying Neural Nets by Discovering Flat Minima , 1994, NIPS.

[65]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[66]  Wolfgang Maass,et al.  Perspectives of Current Research about the Complexity of Learning on Neural Nets , 1994 .

[67]  Wolfgang J. Paul,et al.  Autonomous theory building systems , 1995, Ann. Oper. Res..