Clustering with Bregman Divergences

A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans , the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

[1]  A. Devinatz The representation of functions as a Laplace-Stieltjes integrals , 1955 .

[2]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[5]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[6]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Imre Csiszár,et al.  On the computation of rate-distortion functions (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[9]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[12]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[13]  P. Papantoni-Kazakos,et al.  Spectral distance measures between Gaussian processes , 1980, ICASSP.

[14]  R. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[15]  Calyampudi R. Rao Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[16]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[17]  C. Berg,et al.  Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[21]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[22]  Kenneth Rose,et al.  A mapping approach to rate-distortion computation and analysis , 1994, IEEE Trans. Inf. Theory.

[23]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[24]  I. Csiszár Generalized projections for non-negative functions , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[25]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[26]  Bert Fristedt,et al.  A modern approach to probability theory , 1996 .

[27]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[28]  M. Paluš ON ENTROPY RATES OF DYNAMICAL SYSTEMS AND GAUSSIAN PROCESSES , 1997 .

[29]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[30]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[31]  Toby Berger,et al.  Lossy Source Coding , 1998, IEEE Trans. Inf. Theory.

[32]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[33]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[34]  Manfred K. Warmuth,et al.  Relative Expected Instantaneous Loss Bounds , 2000, J. Comput. Syst. Sci..

[35]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[36]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[37]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[38]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[39]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[40]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[41]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[42]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[43]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[44]  Paul M. B. Vitányi,et al.  Kolmogorov Complexity and Information Theory. With an Interpretation in Terms of Questions and Answers , 2003, J. Log. Lang. Inf..

[45]  Dale Schuurmans,et al.  Learning Continuous Latent Variable Models with Bregman Divergences , 2003, ALT.

[46]  T. Gneiting,et al.  Stationary covariances associated with exponentially convex functions , 2003 .

[47]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[48]  Charu C. Aggarwal,et al.  Towards systematic design of distance functions for data mining applications , 2003, KDD '03.

[49]  Shaojun Wang,et al.  Learning latent variable models with bregman divergences , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[50]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[51]  Xin Guo,et al.  Optimal Bregman prediction and Jensen's equality , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[52]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[53]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[54]  Inderjit S. Dhillon,et al.  An information theoretic analysis of maximum likelihood mixture estimation for exponential families , 2004, ICML.

[55]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[56]  Xin Guo,et al.  On the optimality of conditional expectation as a Bregman predictor , 2005, IEEE Trans. Inf. Theory.