A tutorial introduction to the minimum description length principle

This tutorial provides an overview of and introduction to Rissanen's Minimum Description Length (MDL) Principle. The first chapter provides a conceptual, entirely non-technical introduction to the subject. It serves as a basis for the technical introduction given in the second chapter, in which all the ideas of the first chapter are made mathematically precise. The main ideas are discussed in great conceptual and technical detail. This tutorial is an extended version of the first two chapters of the collection "Advances in Minimum Description Length: Theory and Application" (edited by P.Grunwald, I.J. Myung and M. Pitt, to be published by the MIT Press, Spring 2005).

[1]  L. M. M.-T. Theory of Probability , 1929, Nature.

[2]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[4]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[5]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[6]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[7]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[8]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[9]  W. Warmuth De Finetti, B.: Theory of Probability - A Critical Introductory Treatment, Volume 2. John Wiley & Sons, London-New York-Sydney-Toronto 1975. XIV, 375 S., £ 10.50 , 1977 .

[10]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[13]  U. Hjorth Model Selection and Forward Validation , 1982 .

[14]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[15]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[16]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[17]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[18]  M. Feder Maximum entropy as a special case of the minimum description length criterion , 1986, IEEE Trans. Inf. Theory.

[19]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[20]  李幼升,et al.  Ph , 1989 .

[21]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[22]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Andrew R. Barron,et al.  Complexity Regularization with Application to Artificial Neural Networks , 1991 .

[25]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[26]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[27]  T. Speed,et al.  Model selection and prediction: Normal regression , 1993 .

[28]  Peter Grünwald,et al.  A minimum description length approach to grammar inference , 1995, Learning for Natural Language Processing.

[29]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[30]  Geoffrey I. Webb Further Experimental Evidence against the Utility of Occam's Razor , 1996, J. Artif. Intell. Res..

[31]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .

[32]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[33]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[34]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[35]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[36]  P. Grünwald The Minimum Description Length Principle and Reasoning under Uncertainty , 1998 .

[37]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[40]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[41]  A. Barron,et al.  Asymptotically minimax regret by Bayes mixtures , 1998, Proceedings. 1998 IEEE International Symposium on Information Theory (Cat. No.98CH36252).

[42]  Henry Tirri,et al.  On Supervised Selection of Bayesian Networks , 1999, UAI.

[43]  Kevin B. Korb,et al.  Finding Cutpoints in Noisy Binary Sequences - A Revised Empirical Evaluation , 1999, Australian Joint Conference on Artificial Intelligence.

[44]  Dean P. Foster,et al.  Local Asymptotic Coding and the Minimum Description Length , 1999, IEEE Trans. Inf. Theory.

[45]  A. Dawid,et al.  Prequential probability: principles and properties , 1999 .

[46]  Peter Grünwald Viewing all models as “probabilistic” , 1999, COLT '99.

[47]  Dean P. Foster,et al.  The Competitive Complexity Ratio , 2000 .

[48]  Bin Yu,et al.  Wavelet thresholding via MDL for natural images , 2000, IEEE Trans. Inf. Theory.

[49]  I. Csiszár,et al.  The consistency of the BIC Markov order estimator , 2000 .

[50]  Jorma Rissanen,et al.  MDL Denoising , 2000, IEEE Trans. Inf. Theory.

[51]  I. J. Myung,et al.  Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[52]  Péter Gács,et al.  Algorithmic statistics , 2000, IEEE Trans. Inf. Theory.

[53]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[54]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[55]  G. Shafer,et al.  Probability and Finance: It's Only a Game! , 2001 .

[56]  A. Barron,et al.  Exact minimax strategies for predictive density estimation, data compression and model selection , 2002, Proceedings IEEE International Symposium on Information Theory,.

[57]  Nikolai K. Vereshchagin,et al.  Kolmogorov's structure functions with an application to the foundations of model selection , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[58]  Ted Chang Geometrical foundations of asymptotic inference , 2002 .

[59]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[60]  Bertrand Clarke,et al.  Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored , 2003, J. Mach. Learn. Res..

[61]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[62]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[63]  Pedro M. Domingos The Role of Occam's Razor in Knowledge Discovery , 1999, Data Mining and Knowledge Discovery.

[64]  Dharmendra S. Modha,et al.  Prequential and Cross-Validated Regression Estimation , 1998, Machine Learning.

[65]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[66]  Tong Zhang,et al.  On the Convergence of MDL Density Estimation , 2004, COLT.

[67]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[68]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[69]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[70]  David L. Dowe,et al.  Minimum message length and generalized Bayesian nets with asymmetric languages , 2005 .

[71]  A. Hanson,et al.  Applications of MDL to Selected Families of Models , 2005 .

[72]  A. Barron,et al.  Asymptotically minimax regret for exponential families , 2005 .

[73]  I. J. Myung,et al.  Algorithmic statistics and Kolmogorov’s Structure Functions , 2005 .

[74]  Junnichi Takeuchi On minimax regret with respect to families of stationary stochastic processes , 2005 .

[75]  V. Balasubramanian MDL , Bayesian Inference and the Geometry of the Space of Probability Distributions , 2006 .

[76]  Dean P. Foster,et al.  The Contribution of Parameters to Stochastic Complexity , 2022 .