Discovering Solutions with Low Kolmogorov Complexity and High Generalization Capability

Many machine learning algorithms aim at finding “simple” rules to explain training data. The expectation is: the “simpler” the rules, the better the generalization on test data (→ Occam's razor). Most practical implementations, however, use measures for “simplicity” that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoff's algorithmic probability. Likewise, most previous approaches (especially those of the “Bayesian” kind) suffer from the problem of choosing appropriate priors. This paper addresses both issues. It first reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the Solomonoff-Levin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for finding “algorithmically simple” problem solutions with high generalization capability. The method is based on Levin complexity (a time-bounded extension of Kolmogorov complexity) and inspired by Levin's optimal universal search algorithm. With a given problem, solution candidates are computed by efficient “self-sizing” programs that influence their own runtime and storage size. The probabilistic search algorithm finds the “good” programs (the ones quickly computing algorithmically probable solutions fitting the training data). Experiments focus on the task of discovering “algorithmically simple” neural networks with low Kolmogorov complexity and high generalization capability. These experiments demonstrate that the method, at least with certain toy problems where it is computationally feasible, can lead to generalization results unmatchable by previous neural net algorithms.

[1]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[2]  E. Allender Applications of Time-Bounded Kolmogorov Complexity in Complexity Theory , 1992 .

[3]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[4]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[5]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[6]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[7]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[8]  Gregory J. Chaitin,et al.  A recent technical report , 1974, SIGA.

[9]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[10]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[11]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[12]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[13]  Charles H. Bennett Logical depth and physical complexity , 1988 .

[14]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[15]  Wolfgang J. Paul,et al.  Autonomous theory building systems , 1995, Ann. Oper. Res..

[16]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[17]  Wolfgang Maass,et al.  Perspectives of Current Research about the Complexity of Learning on Neural Nets , 1994 .

[18]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[19]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..