Minimum Description Length Revisited

This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective.

[1]  Kenji Yamanishi,et al.  Efficient Computation of Normalized Maximum Likelihood Codes for Gaussian Mixture Models With Its Applications to Clustering , 2013, IEEE Transactions on Information Theory.

[2]  A. Barron,et al.  Robustly Minimax Codes for Universal Data Compression , 1998 .

[3]  Peter L. Bartlett,et al.  Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families , 2013, COLT.

[4]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[5]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[6]  Shin Matsushima,et al.  Sparse Graphical Modeling via Stochastic Complexity , 2017, SDM.

[7]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[8]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[9]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[10]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Ivo Grosse,et al.  Robust learning of inhomogeneous PMMs , 2014, AISTATS.

[12]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[13]  John Langford,et al.  Suboptimal Behavior of Bayes and MDL in Classification Under Misspecification , 2004, COLT.

[14]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[15]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[16]  Peter Grünwald,et al.  Jeffreys versus Shtarkov distributions associated with some natural exponential families , 2010 .

[17]  P. Grünwald,et al.  Almost the best of three worlds: Risk, consistency and optional stopping for the switch criterion in nested model selection , 2018 .

[18]  Kenji Yamanishi,et al.  The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models , 2019, Data Mining and Knowledge Discovery.

[19]  John Langford,et al.  PAC-MDL Bounds , 2003, COLT.

[20]  Wai Lam,et al.  LEARNING BAYESIAN BELIEF NETWORKS: AN APPROACH BASED ON THE MDL PRINCIPLE , 1994, Comput. Intell..

[21]  David Heckerman,et al.  A Characterization of the Dirichlet Distribution with Application to Learning Bayesian Networks , 1995, UAI.

[22]  P. Grünwald,et al.  Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the AIC–BIC dilemma , 2012 .

[23]  Kailash Budhathoki,et al.  Origo: causal inference by compression , 2016, Knowledge and Information Systems.

[24]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[25]  Peter Grünwald,et al.  Safe Probability , 2016, ArXiv.

[26]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[27]  Wojciech Szpankowski,et al.  Minimax Pointwise Redundancy for Memoryless Models Over Large Alphabets , 2012, IEEE Transactions on Information Theory.

[28]  Tomi Silander,et al.  Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures , 2018, AISTATS.

[29]  Kazuho Watanabe,et al.  Achievability of asymptotic minimax regret by horizon-dependent and horizon-independent strategies , 2015, J. Mach. Learn. Res..

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Jun'ichi Takeuchi,et al.  Barron and Cover's Theory in Supervised Learning and its Application to Lasso , 2016, ICML.

[32]  Kailash Budhathoki,et al.  Origo: causal inference by compression , 2017, Knowledge and Information Systems.

[33]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[34]  Daniel F. Schmidt,et al.  Subset Selection in Linear Regression using Sequentially Normalized Least Squares: Asymptotic Theory , 2016 .

[35]  Steven de Rooij,et al.  Catching Up Faster in Bayesian Model Selection and Model Averaging , 2007, NIPS.

[36]  Danai Koutra,et al.  Summarizing and understanding large graphs , 2015, Stat. Anal. Data Min..

[37]  Peter Grünwald Viewing all models as “probabilistic” , 1999, COLT '99.

[38]  Jorma Rissanen,et al.  Information and Complexity in Statistical Modeling , 2006, ITW.

[39]  Kenji Yamanishi,et al.  An Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models , 2017, ArXiv.

[40]  Jason M. Klusowski,et al.  Finite-Sample Risk Bounds for Maximum Likelihood Estimation With Arbitrary Penalties , 2018, IEEE Transactions on Information Theory.

[41]  L. Pericchi,et al.  BAYES FACTORS AND MARGINAL DISTRIBUTIONS IN INVARIANT SITUATIONS , 2016 .

[42]  A. Barron,et al.  Estimation of mixture models , 1999 .

[43]  Kazuho Watanabe,et al.  Bayesian properties of normalized maximum likelihood and its fast computation , 2014, 2014 IEEE International Symposium on Information Theory.

[44]  J. Rissanen,et al.  ON SEQUENTIALLY NORMALIZED MAXIMUM LIKELIHOOD MODELS , 2008 .

[45]  Atsushi Suzuki,et al.  Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[46]  A. Dawid The geometry of proper scoring rules , 2007 .

[47]  Sumio Watanabe,et al.  A widely applicable Bayesian information criterion , 2012, J. Mach. Learn. Res..

[48]  Jorma Rissanen,et al.  Model selection by sequentially normalized least squares , 2010, J. Multivar. Anal..

[49]  A. Barron,et al.  THE MDL PRINCIPLE , PENALIZED LIKELIHOODS , AND STATISTICAL RISK , 2008 .

[50]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[51]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[52]  Peter Harremoes,et al.  Finiteness of redundancy, regret, Shtarkov sums, and Jeffreys integrals in exponential families , 2009, 2009 IEEE International Symposium on Information Theory.

[53]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[54]  Teemu Roos,et al.  Robust Sequential Prediction in Linear Regression with Student's t-distribution , 2016, ISAIM.

[55]  Kenji Yamanishi,et al.  High-dimensional penalty selection via minimum description length principle , 2018, Machine Learning.

[56]  I. J. Myung,et al.  Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[57]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[58]  Sumio Watanabe,et al.  Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory , 2010, J. Mach. Learn. Res..

[59]  Andrew R. Barron,et al.  Mixture Density Estimation , 1999, NIPS.

[60]  Tom Sterkenburg Universal Prediction: A Philosophical Investigation , 2018 .

[61]  Carl E. Rasmussen,et al.  Occam's Razor , 2000, NIPS.

[62]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[63]  J. Berger,et al.  Unified Conditional Frequentist and Bayesian Testing of Composite Hypotheses , 2003 .

[64]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[65]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[66]  Tomi Silander,et al.  Learning locally minimax optimal Bayesian networks , 2010, Int. J. Approx. Reason..

[67]  Ivo Grosse,et al.  Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data , 2015, BMC Bioinformatics.

[68]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[69]  Manfred K. Warmuth,et al.  The Last-Step Minimax Algorithm , 2000, ALT.

[70]  Fumiyasu Komaki,et al.  Relations Between the Conditional Normalized Maximum Likelihood Distributions and the Latent Information Priors , 2016, IEEE Transactions on Information Theory.

[71]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[72]  Teemu Roos Monte Carlo estimation of minimax regret with an application to MDL model selection , 2008, 2008 IEEE Information Theory Workshop.

[73]  Petri Myllymäki,et al.  A linear-time algorithm for computing the multinomial stochastic complexity , 2007, Inf. Process. Lett..

[74]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[75]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[76]  Andrew R. Barron,et al.  Improved MDL Estimators Using Local Exponential Family Bundles Applied to Mixture Families , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[77]  Zou On Model Selection , Bayesian Networks , and the Fisher Information Integral , 2022 .

[78]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[79]  Ryan P. Adams,et al.  Compressibility and Generalization in Large-Scale Deep Learning , 2018, ArXiv.

[80]  Wojciech Kotlowski,et al.  Prequential plug-in codes that achieve optimal redundancy rates even if the model is wrong , 2010, 2010 IEEE International Symposium on Information Theory.

[81]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[82]  Andrew R. Barron,et al.  MDL Procedures with ` 1 Penalty and their Statistical Risk , 2008 .

[83]  Atsushi Suzuki,et al.  Structure Selection for Convolutive Non-negative Matrix Factorization Using Normalized Maximum Likelihood Coding , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[84]  R. Bouckaert Minimum Description Length Principle , 1994 .

[85]  E. Wagenmakers,et al.  Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology , 2016 .

[86]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[87]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[88]  Rianne de Heide,et al.  the safe-bayesian lasso , 2016 .

[89]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[90]  Andrew R. Barron,et al.  Information theoretic validity of penalized likelihood , 2014, 2014 IEEE International Symposium on Information Theory.

[91]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[92]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.