Computationally Efficient Methods for MDL-Optimal Density Estimation and Data Clustering

The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data. Computing Reviews (1998)

[1]  Murray R. Spiegel,et al.  Schaum's outline of theory and problems of complex variables , 1974 .

[2]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[3]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[4]  Gaston H. Gonnet,et al.  On the LambertW function , 1996, Adv. Comput. Math..

[5]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[6]  Petri Myllymäki,et al.  Computing the NML for Bayesian forests via matrices and generating polynomials , 2008, 2008 IEEE Information Theory Workshop.

[7]  Mikko Koivisto,et al.  Sum-Product Algorithms for the Analysis of Genetic Risks , 2004 .

[8]  Henry Tirri,et al.  Minimum Encoding Approaches for Predictive Modeling , 1998, UAI.

[9]  Yves Rozenholc,et al.  How many bins should be put in a regular histogram , 2006 .

[10]  Petri Myllymäki,et al.  A Fast Normalized Maximum Likelihood Algorithm for Multinomial Data , 2005, IJCAI.

[11]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[12]  Petri Myllymäki,et al.  Fast NML Computation for Naive Bayes Models , 2007, Discovery Science.

[13]  A. Odlyzko Asymptotic enumeration methods , 1996 .

[14]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[15]  Henry C. Thacher,et al.  Applied and Computational Complex Analysis. , 1988 .

[16]  Henry Tirri,et al.  On predictive distributions and Bayesian networks , 2000, Stat. Comput..

[17]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[18]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[19]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[20]  Bruce R. Fabijonas,et al.  Laplace's method on a computer algebra system with an application to the real valued modified Bessel functions , 2002 .

[21]  Hannes Wettig,et al.  NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks , 2008, EURASIP J. Bioinform. Syst. Biol..

[22]  T. Roos,et al.  Bayesian network structure learning using factorized NML universal models , 2008, 2008 Information Theory and Applications Workshop.

[23]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[24]  T. Speed,et al.  Data compression and histograms , 1992 .

[25]  Henry Tirri,et al.  Supervised model-based visualization of high-dimensional data , 2000, Intell. Data Anal..

[26]  Petri Myllymäki,et al.  A linear-time algorithm for computing the multinomial stochastic complexity , 2007, Inf. Process. Lett..

[27]  P. Myllymaki,et al.  On recurrence formulas for computing the stochastic complexity , 2008, 2008 International Symposium on Information Theory and Its Applications.

[28]  E. Hannan,et al.  On stochastic complexity and nonparametric density estimation , 1988 .

[29]  Jorma Rissanen,et al.  Efficient Computation of Stochastic Complexity , 2003 .

[30]  Jorma Rissanen,et al.  Information and Complexity in Statistical Modeling , 2006, ITW.

[31]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[32]  Philippe Flajolet,et al.  Singularity Analysis of Generating Functions , 1990, SIAM J. Discret. Math..

[33]  Henry Tirri,et al.  On Bayesian Case Matching , 1998, EWCBR.

[34]  Philippe Flajolet,et al.  The Average case analysis of algorithms : complex asymptotics and generating functions , 1993 .

[35]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[36]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[37]  Dennis G. Zill,et al.  A First Course in Complex Analysis With Applications , 2006 .

[38]  D. Knuth,et al.  Mathematics for the Analysis of Algorithms , 1999 .

[39]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[40]  S. Griffis EDITOR , 1997, Journal of Navigation.

[41]  Peter Henrici,et al.  Automatic Computations with Power Series , 1956, JACM.

[42]  Petri Myllymäki,et al.  Computing the Multinomial Stochastic Complexity in Sub-Linear Time , 2008 .

[43]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[44]  Petri Myllymäki,et al.  MDL Histogram Density Estimation , 2007, AISTATS.

[45]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[46]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[47]  Henry Tirri,et al.  On the Behavior of MDL Denoising , 2005, AISTATS.

[48]  Jaakko Astola,et al.  Classification and feature gene selection using the normalized maximum likelihood model for discrete regression , 2003, Signal Process..

[49]  Tristan Needham,et al.  Visual Complex Analysis , 1997 .

[50]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[51]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[52]  V. Balasubramanian MDL , Bayesian Inference and the Geometry of the Space of Probability Distributions , 2006 .

[53]  D. Knuth,et al.  A recurrence related to trees , 1989 .

[54]  Jorma Rissanen,et al.  Density estimation by stochastic complexity , 1992, IEEE Trans. Inf. Theory.

[55]  Henry Tirri Plausible Prediction by Bayesian Inference , 1997 .

[56]  A. Barron,et al.  Asymptotic minimax regret for data compression, gambling and prediction , 1997, Proceedings of IEEE International Symposium on Information Theory.

[57]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[58]  Petri Myllymäki,et al.  An Empirical Comparison of NML Clustering Algorithms , 2008, ITSL.

[59]  V. K. Balakrishnan Schaum's outline of theory and problems of combinatorics , 1995 .

[60]  Padhraic Smyth,et al.  Probabilistic Model-Based Clustering of Multivariate and Sequential Data , 1999 .

[61]  E. Elovaara MDL-BASED ATTRIBUTE MODELS IN NA ÏVE BAYES CLASSIFICATION , 2009 .

[62]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[63]  Hannes Wettig,et al.  Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests , 2007 .

[64]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[65]  Petri Myllymäki,et al.  On the Multinomial Stochastic Complexity and its Connection to the Birthday Problem , 2008, ITSL.

[66]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[67]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.