Applying MDL to learn best model granularity

Abstract The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends critically on the granularity, for example the choice of precision of the parameters. Too high precision generally involves modeling of accidental noise and too low precision may lead to confusion of models that should be distinguished. This precision is often determined ad hoc. In MDL the best model is the one that most compresses a two-part code of the data set: this embodies “Occam's Razor”. In two quite different experimental settings the theoretical value determined using MDL coincides with the best value found experimentally. In the first experiment the task is to recognize isolated handwritten characters in one subject's handwriting, irrespective of size and orientation. Based on a new modification of elastic matching, using multiple prototypes per character, the optimal prediction rate is predicted for the learned parameter (length of sampling interval) considered most likely by MDL, which is shown to coincide with the best value found experimentally. In the second experiment the task is to model a robot arm with two degrees of freedom using a three layer feed-forward neural network where we need to determine the number of nodes in the hidden layer giving best modeling performance. The optimal model (the one that extrapolizes best on unseen examples) is predicted for the number of nodes in the hidden layer considered most likely by MDL, which again is found to coincide with the best value found experimentally.

[1]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[2]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[3]  Joost N. Kok,et al.  Model selection for neural networks: comparing MDL and NIC , 1994, ESANN.

[4]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  M. Berthod,et al.  Automatic recognition of handprinted characters—The state of the art , 1980, Proceedings of the IEEE.

[7]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[8]  Charles C. Tappert,et al.  Cursive Script Recognition by Elastic Matching , 1982, IBM J. Res. Dev..

[9]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[10]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[11]  Ming Li,et al.  Inductive Reasoning and Kolmogorov Complexity , 1992, J. Comput. Syst. Sci..

[12]  Paul M. B. Vitányi,et al.  The miraculous universal distribution , 1997 .

[13]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[14]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[15]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[16]  L. M. M.-T. Theory of Probability , 1929, Nature.

[17]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[18]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  Jerome M. Kurtzberg,et al.  Feature Analysis for Symbol Recognition by Elastic Matching , 1987, IBM J. Res. Dev..

[21]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[22]  Per Martin-Löf,et al.  The Definition of Random Sequences , 1966, Inf. Control..

[23]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[24]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[25]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[26]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.

[27]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[28]  R. Mises Grundlagen der Wahrscheinlichkeitsrechnung , 1919 .