An information theoretic analysis of maximum likelihood mixture estimation for exponential families

An important task in unsupervised learning is maximum likelihood mixture estimation (MLME) for exponential families. In this paper, we prove a mathematical equivalence between this MLME problem and the rate distortion problem for Bregman divergences. We also present new theoretical results in rate distortion theory for Bregman divergences. Further, an analysis of the problems as a trade-off between compression and preservation of information is presented that yields the information bottleneck method as an interesting special case.

[1]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  William A. Pearlman,et al.  Optimal encoding of discrete-time continuous-amplitude memoryless sources with finite output alphabets , 1980, IEEE Trans. Inf. Theory.

[4]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Kenneth Rose,et al.  A mapping approach to rate-distortion computation and analysis , 1994, IEEE Trans. Inf. Theory.

[7]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[8]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[9]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[10]  Manfred K. Warmuth,et al.  Relative Expected Instantaneous Loss Bounds , 2000, J. Comput. Syst. Sci..

[11]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[12]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[13]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[14]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[15]  Naftali Tishby,et al.  An Information Theoretic Tradeoff between Complexity and Accuracy , 2003, COLT.

[16]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[18]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..