Model Selection and the Principle of Minimum Description Length

This article reviews the principle of minimum description length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed attention within the statistics community. Here we review both the practical and the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines we find many interesting interpretations of popular frequentist and Bayesian procedures. As we show, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate the MDL principle by considering problems in regression, nonparametric curve estimation, cluster analysis, and time series analysis. Because model selection in linear regression is an extremely common problem that arises in many applications, we present detailed derivations of several MDL criteria in this context and discuss their properties through a number of examples. Our emphasis is on the practical application of MDL, and hence we make extensive use of real datasets. In writing this review, we tried to make the descriptive philosophy of MDL natural to a statistics audience by examining classical problems in model selection. In the engineering literature, however, MDL is being applied to ever more exotic modeling situations. As a principle for statistical modeling in general, one strength of MDL is that it can be intuitively extended to provide useful tools for new problems.

[1]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[2]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[3]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[4]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[5]  Bruno Torrésani,et al.  Time-Frequency and Time-Scale Analysis , 1999 .

[6]  李幼升,et al.  Ph , 1989 .

[7]  E. Hannan,et al.  Recursive estimation of autoregressions , 1989 .

[8]  A. W. Kemp,et al.  Kendall's Advanced Theory of Statistics. , 1994 .

[9]  Pierre Moulin Signal estimation using adapted tree-structured bases and the MDL principle , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[10]  G. Wahba,et al.  Hybrid Adaptive Splines , 1997 .

[11]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[12]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[13]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[14]  R. Shibata An optimal selection of regression variables , 1981 .

[15]  B. Burr,et al.  Development and Application of Molecular Markers to Problems in Plant Genetics , 1989 .

[16]  Dean P. Foster,et al.  The Competitive Complexity Ratio , 2000 .

[17]  Charles Kooperberg,et al.  Spline Adaptation in Extended Linear Models (with comments and a rejoinder by the authors , 2002 .

[18]  R. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[19]  J. Berger,et al.  The Intrinsic Bayes Factor for Model Selection and Prediction , 1996 .

[20]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[21]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[22]  David Haussler,et al.  A general minimax result for relative entropy , 1997, IEEE Trans. Inf. Theory.

[23]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[24]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY, AND RISK IN ESTIMATION OF PROBABILITY DISTRIBUTIONS , 1996 .

[25]  Franklin A. Graybill,et al.  Introduction to The theory , 1974 .

[26]  A. Long,et al.  High resolution mapping of genetic factors affecting abdominal bristle number in Drosophila melanogaster. , 1995, Genetics.

[27]  D. Findley Counterexamples to parsimony and BIC , 1991 .

[28]  J. Jobson Applied Multivariate Data Analysis , 1995 .

[29]  H. Akaike A new look at the statistical model identification , 1974 .

[30]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[31]  T. Speed,et al.  Model selection and prediction: Normal regression , 1993 .

[32]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[33]  A. F. Smith Present Position and Potential Developments: Some Personal Views Bayesian Statistics , 1984 .

[34]  Neri Merhav,et al.  A strong version of the redundancy-capacity theorem of universal coding , 1995, IEEE Trans. Inf. Theory.

[35]  B. D. Finetti,et al.  Bayesian inference and decision techniques : essays in honor of Bruno de Finetti , 1986 .

[36]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[37]  L. Gerencsér On Rissanen's predictive stochastic complexity for stationary ARMA processes , 1994 .

[38]  Dean Phillips Foster,et al.  Calibration and empirical Bayes variable selection , 2000 .

[39]  Nicholas G. Polson,et al.  A Monte Carlo Approach to Nonnormal and Nonlinear State-Space Modeling , 1992 .

[40]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[41]  Stanley L. Sclove,et al.  Improved Estimators for Coefficients in Linear Regression , 1968 .

[42]  C. L. Mallows Some comments on C_p , 1973 .

[43]  C. Morris,et al.  Non-Optimality of Preliminary-Test Estimators for the Mean of a Multivariate Normal Distribution , 1972 .

[44]  SELECTING ORDER FOR GENERAL AUTOREGRESSIVE MODELS BY MINIMUM DESCRIPTION LENGTH , 1990 .

[45]  An Hongzhi,et al.  On the selection of regression variables , 1985 .

[46]  Praveen Kumar,et al.  Wavelets in Geophysics , 1994 .

[47]  D. J. Merrell,et al.  IN DROSOPHILA MELANOGASTER , 1983 .

[48]  M. Clyde,et al.  Prediction via Orthogonalized Model Mixing , 1996 .

[49]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[50]  C. L. Mallows Some Comments onCp , 1973 .

[51]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[52]  E. Hannan,et al.  Recursive estimation of mixed autoregressive-moving average order , 1982 .

[53]  P. Brockwell,et al.  Time Series: Theory and Methods , 2013 .

[54]  R. Kohn,et al.  Nonparametric regression using Bayesian variable selection , 1996 .

[55]  E. J. Hannan,et al.  A method for autoregressive-moving average estimation , 1984 .

[56]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[57]  Wolfgang Foerstner,et al.  Segmentation of remotely sensed images by MDL-principled polygon map grammar , 1994, Other Conferences.

[58]  A. O'Hagan,et al.  Fractional Bayes factors for model comparison , 1995 .

[59]  L. Schumaker Spline Functions: Basic Theory , 1981 .

[60]  Jorma Rissanen,et al.  A Predictive Least-Squares Principle , 1986 .

[61]  柴田 里程 Selection of regression variables , 1981 .

[62]  D. Spiegelhalter,et al.  Bayes Factors and Choice Criteria for Linear Models , 1980 .

[63]  D. Haussler,et al.  MUTUAL INFORMATION, METRIC ENTROPY AND CUMULATIVE RELATIVE ENTROPY RISK , 1997 .

[64]  K. Broman Identifying Quantitative Trait Loci in Experimental Crosses , 1997 .

[65]  Hirotugu Akaike An objective use of Bayesian models , 1977 .

[66]  S. Tanksley,et al.  QTL analysis of transgressive segregation in an interspecific tomato cross. , 1993, Genetics.

[67]  Lee D. Davisson,et al.  Universal noiseless coding , 1973, IEEE Trans. Inf. Theory.

[68]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[69]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[70]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[71]  Y. Shtarkov AIM FUNCTIONS AND SEQUENTIAL ESTIMATION OF THE SOURCE MODEL FOR UNIVERSAL CODING , 1999 .

[72]  Catherine S. Forbes,et al.  Model Selection Criteria for Segmented Time Series from a Bayesian Approach to Information Compression , 2002 .

[73]  Neri Merhav,et al.  On the estimation of the order of a Markov chain and universal data compression , 1989, IEEE Trans. Inf. Theory.

[74]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[75]  R. L. Dekock Some Comments , 2021 .

[76]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[77]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[78]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[79]  Alberto Leon-Garcia,et al.  A source matching approach to finding minimax codes , 1980, IEEE Trans. Inf. Theory.

[80]  Naoki Saito,et al.  Simultaneous noise suppression and signal compression using a library of orthonormal bases and the minimum-description-length criterion , 1994, Defense, Security, and Sensing.

[81]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[82]  Tze Leung Lai,et al.  INFORMATION AND PREDICTION CRITERIA FOR MODEL SELECTION IN STOCHASTIC REGRESSION AND ARMA MODELS , 1997 .

[83]  Paul M. B. Vitányi,et al.  Three approaches to the quantitative definition of information in an individual pure quantum state , 1999, Proceedings 15th Annual IEEE Conference on Computational Complexity.

[84]  A note on some model selection criteria , 1986 .

[85]  George Gabor,et al.  Generalised linear model selection by the predictive least quasi-deviance criterion , 1996 .

[86]  R. Doerge,et al.  Permutation tests for multiple loci affecting a quantitative character. , 1996, Genetics.

[87]  Mark H. A. Davis,et al.  Strong Consistency of the PLS Criterion for Order Determination of Autoregressive Processes , 1989 .

[88]  Andrei N. Kolmogorov,et al.  Logical basis for information theory and probability theory , 1968, IEEE Trans. Inf. Theory.

[89]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[90]  M. Hansen,et al.  Spline Adaptation in Extended Linear Models , 1998 .

[91]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[92]  F. Kianifard Applied Multivariate Data Analysis: Volume II: Categorical and Multivariate Methods , 1994 .

[93]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[94]  Neri Merhav,et al.  The estimation of the model order in exponential families , 1989, IEEE Trans. Inf. Theory.

[95]  G. Barrie Wetherill The generalised linear model , 1981 .

[96]  G. Wahba Spline Models for Observational Data , 1990 .

[97]  C. H. Oh,et al.  Some comments on , 1998 .

[98]  C. Mallows More comments on C p , 1995 .

[99]  T. Speed,et al.  Data compression and histograms , 1992 .

[100]  M. Clyde,et al.  Multiple shrinkage and subset selection in wavelets , 1998 .

[101]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[102]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[103]  Maurice G. Kendall,et al.  The advanced theory of statistics , 1945 .

[104]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[105]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .

[106]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .