Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Selection

Investigators interested in model order estimation have tended to divide themselves into widely separated camps; this survey of the contributions of Schwarz, Wallace, Rissanen, and their coworkers attempts to build bridges between the various viewpoints, illuminating connections which may have previously gone unnoticed and clarifying misconceptions which seem to have propagated in the applied literature. Our tour begins with Schwarz's approximation of Bayesian integrals via Laplace's method. We then introduce the concepts underlying Rissanen's minimum description length principle via a Bayesian scenario with a known prior; this provides the groundwork for understanding his more complex non‐Bayesian MDL which employs a “universal” encoding of the integers. Rissanen's method of parameter truncation is contrasted with that employed in various versions of Wallace's minimum message length criteria. Rissanen's more recent notion of stochastic complexity is outlined in terms of Bernardo's information‐theoretic derivation of the Jeffreys prior.

[1]  H. Jeffreys,et al.  Theory of probability , 1896 .

[2]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[4]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[5]  G. N. Lance,et al.  Note on a New Information-Statistic Classificatory Program , 1968, Comput. J..

[6]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[7]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[8]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  H. Akaike A new look at the statistical model identification , 1974 .

[11]  R. Shibata Selection of the order of an autoregressive model by Akaike's information criterion , 1976 .

[12]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[13]  M. Stone An Asymptotic Equivalence of Choice of Model by Cross‐Validation and Akaike's Criterion , 1977 .

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  A. Atkinson Posterior probabilities for choosing a regression model , 1978 .

[16]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[17]  J. Bernardo Reference Posterior Distributions for Bayesian Inference , 1979 .

[18]  H. Akaike A Bayesian extension of the minimum AIC procedure of autoregressive model fitting , 1979 .

[19]  M. Stone Comments on Model Selection Criteria of Akaike and Schwarz , 1979 .

[20]  C. S. Wallace,et al.  Archaeoastronomy in the Old World: STONE CIRCLE GEOMETRIES: AN INFORMATION THEORY APPROACH , 1982 .

[21]  Paul L. Zador,et al.  Asymptotic quantization error of continuous signals and the quantization dimension , 1982, IEEE Trans. Inf. Theory.

[22]  G. Pólya,et al.  Problems and theorems in analysis , 1983 .

[23]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[24]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[25]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[26]  Anne Lohrli Chapman and Hall , 1985 .

[27]  Thomas Kailath,et al.  Detection of signals by information theoretic criteria , 1985, IEEE Trans. Acoust. Speech Signal Process..

[28]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[29]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[30]  Gregory. J. Chaitin,et al.  Algorithmic information theory , 1987, Cambridge tracts in theoretical computer science.

[31]  C. C. Taylor Akaike's information criterion and the histogram , 1987 .

[32]  D. Poskitt Precision, Complexity and Bayesian Model Determination , 1987 .

[33]  E. Hannan,et al.  On stochastic complexity and nonparametric density estimation , 1988 .

[34]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[35]  N. J. A. Sloane,et al.  Sphere Packings, Lattices and Groups , 1987, Grundlehren der mathematischen Wissenschaften.

[36]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[37]  Michael I. Miller,et al.  A Bayesian approach incorporating Rissanen complexity for learning Markov random field texture models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[38]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[39]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[40]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[41]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[42]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[43]  C. S. Wallace,et al.  Single-factor analysis by minimum message length estimation , 1992 .

[44]  K. Mark,et al.  Bayesian model selection and minimum description length estimation of auditory-nerve discharge rates. , 1992, The Journal of the Acoustical Society of America.

[45]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[46]  R. Jaszczak,et al.  Parameter estimation of finite mixtures using the EM algorithm and information criteria with application to medical image processing , 1992 .

[47]  Y. Sakamoto,et al.  Categorical data analysis by AIC , 1992 .

[48]  Ping Zhang On the convergence rate of model selection criteria , 1993 .

[49]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[50]  Daniel R. Fuhrmann,et al.  Multiple target detection for an antenna array using outlier rejection methods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[52]  Rohan A. Baxter,et al.  MML and Bayesianism: similarities and differences: introduction to minimum encoding inference Part , 1994 .

[53]  Jonathan J. Oliver Introduction to Minimum Encoding Inference , 1994 .

[54]  Walter R. Gilks,et al.  Hypothesis testing and model selection , 1995 .

[55]  Tony Lindeberg,et al.  Segmentation and classification of edges using minimum description length approximation , 1995 .

[56]  David Draper,et al.  Assessment and Propagation of Model Uncertainty , 2011 .

[57]  B. Carlin,et al.  Bayesian Model Choice Via Markov Chain Monte Carlo Methods , 1995 .

[58]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[59]  Peter McCullagh,et al.  Laplace Approximation of High Dimensional Integrals , 1995 .

[60]  Jorma Rissanen,et al.  Stochastic Complexity and Its Applications , 1995 .

[61]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[62]  Ming Li,et al.  Ideal MDL and Its Relation To Bayesianism , 1996 .

[63]  Adrian E. Raftery,et al.  Hypothesis testing and model selection , 1996 .

[64]  David L. Dowe,et al.  MML Estimation of the Parameters of the Sherical Fisher Distribution , 1996, ALT.

[65]  C. S. Wallace,et al.  Bayesian Estimation of the Von Mises Concentration Parameter , 1996 .

[66]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[67]  Jorma Rissanen,et al.  Stochastic Complexity in Learning , 1995, J. Comput. Syst. Sci..

[68]  Tony Lindeberg,et al.  Segmentation and Classification of Edges Using Minimum Description Length Approximation and Complementary Junction Cues , 1996, Comput. Vis. Image Underst..

[69]  Nozer D. Singpurwalla,et al.  Non-informative priors do not exist A dialogue with José M. Bernardo , 1997 .

[70]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[71]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[72]  J. Takeuchi Characterization of the Bayes estimator and the MDL estimator for exponential families , 1997, IEEE Trans. Inf. Theory.

[73]  Pierre Moulin,et al.  Complexity-regularized image denoising , 1997, Proceedings of International Conference on Image Processing.

[74]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[75]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[76]  P. Grünwald The Minimum Description Length Principle and Reasoning under Uncertainty , 1998 .

[77]  J. O’Sullivan Alternating Minimization Algorithms: From Blahut-Arimoto to Expectation-Maximization , 1998 .

[78]  Zhiyi Chi,et al.  On the Consistency of Minimum Complexity Nonparametric Estimation , 1998, IEEE Trans. Inf. Theory.

[79]  J. Cavanaugh,et al.  An Akaike information criterion for model selection in the presence of incomplete data , 1998 .

[80]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[81]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[82]  Henry Tirri,et al.  Minimum Encoding Approaches for Predictive Modeling , 1998, UAI.

[83]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[84]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[85]  Modeling clutter and target signatures for pattern-theoretic understanding of infrared scenes , 1998 .

[86]  Aaron D. Lanterman,et al.  Minimum description length understanding of infrared scenes , 1998, Defense, Security, and Sensing.

[87]  Pierre Moulin,et al.  Complexity-regularized image restoration , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[88]  Hans R. Künsch,et al.  Some Notes on Rissanen's Stochastic Complexity , 1998, IEEE Trans. Inf. Theory.

[89]  Michael I. Miller,et al.  Rate-Distortion Theoretic Codebook Design for automatic Object Recognition , 1998 .

[90]  Allan D R McQuarrie,et al.  A small-sample correction for the Schwarz SIC model selection criterion , 1999 .

[91]  M. King,et al.  Improved model selection criterion , 1999 .

[92]  Pierre Moulin,et al.  Analysis of Multiresolution Image Denoising Schemes Using Generalized Gaussian and Complexity Priors , 1999, IEEE Trans. Inf. Theory.

[93]  Andrew C. Singer,et al.  Universal linear prediction by model order weighting , 1999, IEEE Trans. Signal Process..

[94]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[95]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[96]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[97]  Dean P. Foster,et al.  Local Asymptotic Coding and the Minimum Description Length , 1999, IEEE Trans. Inf. Theory.

[98]  Wasserman,et al.  Bayesian Model Selection and Model Averaging. , 2000, Journal of mathematical psychology.

[99]  Henry Tirri,et al.  On predictive distributions and Bayesian networks , 2000, Stat. Comput..

[100]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[101]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[102]  Jorma Rissanen,et al.  MDL Denoising , 2000, IEEE Trans. Inf. Theory.

[103]  Jonathan J. Oliver,et al.  Finding overlapping components with MML , 2000, Stat. Comput..

[104]  Michael I. Miller,et al.  Asymptotic performance analysis of Bayesian target recognition , 2000, IEEE Trans. Inf. Theory.

[105]  David R. Anderson,et al.  Model selection and inference : a practical information-theoretic approach , 2000 .

[106]  Aaron D. Lanterman Bayesian inference of thermodynamic state incorporating Schwarz-Rissanen complexity for infrared target recognition , 2000 .

[107]  David Maxwell Chickering,et al.  A comparison of scientific and engineering criteria for Bayesian model selection , 2000, Stat. Comput..

[108]  H. Bozdogan,et al.  Akaike's Information Criterion and Recent Developments in Information Complexity. , 2000, Journal of mathematical psychology.

[109]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[110]  Michael I. Miller,et al.  Rate-distortion theory applied to automatic object recognition , 2000, IEEE Trans. Inf. Theory.

[111]  Grünwald,et al.  Model Selection Based on Minimum Description Length. , 2000, Journal of mathematical psychology.

[112]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[113]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[114]  Thomas C. M. Lee,et al.  An Introduction to Coding Theory and the Two‐Part Minimum Description Length Principle , 2001 .

[115]  Jorma Rissanen,et al.  Lectures on Statistical Modeling Theory , 2002 .