Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression. The application-specific cost is minimized subject to a constraint on the randomness of the solution, which is gradually lowered. We emphasize the intuition gained from analogy to statistical physics. Alternatively the method is derived within rate-distortion theory, where the annealing process is equivalent to computation of Shannon's rate-distortion function, and the annealing temperature is inversely proportional to the slope of the curve. The basic algorithm is extended by incorporating structural constraints to allow optimization of numerous popular structures including vector quantizers, decision trees, multilayer perceptrons, radial basis functions, and mixtures of experts.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[3]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[6]  P. Wintz,et al.  Quantizing for Noisy Channels , 1969 .

[7]  D. A. Bell,et al.  Information Theory and Reliable Communication , 1969 .

[8]  G. Longo Source Coding Theory , 1970 .

[9]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[10]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[11]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[12]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[13]  G. C. McDonald,et al.  Instabilities of Regression Estimates Relating Air Pollution to Mortality , 1973 .

[14]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[15]  H. Akaike A new look at the statistical model identification , 1974 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[18]  John A. Stankovic,et al.  Distributed Processing , 1978, Computer.

[19]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[20]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[22]  Rodney W. Johnson,et al.  Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy , 1980, IEEE Trans. Inf. Theory.

[23]  R. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[24]  Robert M. Gray,et al.  Joint source and noisy channel trellis encoding , 1981, IEEE Trans. Inf. Theory.

[25]  Robert M. Gray,et al.  Multiple local optima in vector quantizers , 1982, IEEE Trans. Inf. Theory.

[26]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[27]  Toby Berger Minimum entropy quantizers and permutation codes , 1982, IEEE Trans. Inf. Theory.

[28]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[29]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Masao Kasahara,et al.  A construction of vector quantizers for noisy channels , 1984 .

[31]  Nariman Farvardin,et al.  Optimum quantizer performance for a class of non-Gaussian memoryless sources , 1984, IEEE Trans. Inf. Theory.

[32]  Bruce Hajek,et al.  A tutorial survey of theory and applications of simulated annealing , 1985, 1985 24th IEEE Conference on Decision and Control.

[33]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[34]  Pao-Chi Chang,et al.  Gradient algorithms for designing predictive vector quantizers , 1986, IEEE Trans. Acoust. Speech Signal Process..

[35]  Robert M. Gray,et al.  The design of joint source and channel trellis waveform coders , 1987, IEEE Trans. Inf. Theory.

[36]  H. Szu,et al.  Nonconvex optimization by fast simulated annealing , 1987, Proceedings of the IEEE.

[37]  Richard Durbin,et al.  An analogue approach to the travelling salesman problem using an elastic net method , 1987, Nature.

[38]  Robert M. Gray,et al.  Probability, Random Processes, And Ergodic Properties , 1987 .

[39]  Richard E. Blahut,et al.  Principles and practice of information theory , 1987 .

[40]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[41]  T. Kohonen,et al.  Statistical pattern recognition with neural networks: benchmarking studies , 1988, IEEE 1988 International Conference on Neural Networks.

[42]  Stuart German,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1988 .

[43]  Allen Gersho,et al.  Vector quantizer design for memoryless noisy channels , 1988, IEEE International Conference on Communications, - Spanning the Universe..

[44]  Philip A. Chou,et al.  Optimal pruning with applications to tree-structured source coding and modeling , 1989, IEEE Trans. Inf. Theory.

[45]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[46]  S. P. Luttrell,et al.  Hierarchical vector quantisation , 1989 .

[47]  Richard Szeliski,et al.  An Analysis of the Elastic Net Approach to the Traveling Salesman Problem , 1989, Neural Computation.

[48]  Philip A. Chou,et al.  Entropy-constrained vector quantization , 1989, IEEE Trans. Acoust. Speech Signal Process..

[49]  Allen Gersho,et al.  Optimal nonlinear interpolative vector quantization , 1990, IEEE Trans. Commun..

[50]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[51]  Stephen P. Luttrell,et al.  Derivation of a class of training algorithms , 1990, IEEE Trans. Neural Networks.

[52]  Alan L. Yuille,et al.  Generalized Deformable Models, Statistical Physics, and Matching Problems , 1990, Neural Computation.

[53]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[54]  Petar D. Simic,et al.  Statistical mechanics as the underlying theory of ‘elastic’ and ‘neural’ optimisations , 1990 .

[55]  Nariman Farvardin,et al.  A study of vector quantization for noisy channels , 1990, IEEE Trans. Inf. Theory.

[56]  R.M. Gray,et al.  A greedy tree growing algorithm for the design of variable rate vector quantizers [image compression] , 1991, IEEE Trans. Signal Process..

[57]  V. Cherkassky,et al.  Self-organizing network for regression: efficient implementation and comparative evaluation , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[58]  Petar D. Simic Constrained Nets for Graph Matching and Other Quadratic Assignment Problems , 1991, Neural Comput..

[59]  Federico Girosi,et al.  Parallel and Deterministic Algorithms from MRFs: Surface Reconstruction , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[62]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[63]  Kenneth Rose Deterministic annealing, clustering, and optimization , 1991 .

[64]  Nariman Farvardin,et al.  On the performance and complexity of channel-optimized vector quantizers , 1991, IEEE Trans. Inf. Theory.

[65]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[66]  Mohamad T. Musavi,et al.  On the training of radial basis function classifiers , 1992, Neural Networks.

[67]  Allen Gersho,et al.  Competitive learning and soft competition for vector quantizer design , 1992, IEEE Trans. Signal Process..

[68]  Wesley E. Snyder,et al.  Mean field annealing: a formalism for constructing GNC-like algorithms , 1992, IEEE Trans. Neural Networks.

[69]  David J. Miller,et al.  Constrained clustering for data assignment problems with examples of module placement , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[70]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[71]  Michael I. Jordan,et al.  Learning piecewise control strategies in a modular neural network architecture , 1993, IEEE Trans. Syst. Man Cybern..

[72]  Kenneth Rose,et al.  An improved sequential search multistage vector quantizer , 1993, [Proceedings] DCC `93: Data Compression Conference.

[73]  Geoffrey C. Fox,et al.  Constrained Clustering as an Optimization Method , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[74]  Yiu-Fai Wong,et al.  Clustering Data by Melting , 1993, Neural Computation.

[75]  Joachim M. Buhmann,et al.  Vector quantization with complexity costs , 1993, IEEE Trans. Inf. Theory.

[76]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[77]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[78]  Kenneth Rose,et al.  Deterministic annealing for trellis quantizer and HMM design using Baum-Welch re-estimation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[79]  Kenneth Rose,et al.  A mapping approach to rate-distortion computation and analysis , 1994, IEEE Trans. Inf. Theory.

[80]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[81]  Jenq-Neng Hwang,et al.  Regression modeling in back-propagation and projection pursuit learning , 1994, IEEE Trans. Neural Networks.

[82]  Jieyu Zhao,et al.  Neural Network Optimization for Good Generalization Performance , 1994 .

[83]  Stephen J. Roberts,et al.  Supervised and unsupervised learning in radial basis function classifiers , 1994 .

[84]  Naonori Ueda,et al.  Deterministic Annealing Variant of the EM Algorithm , 1994, NIPS.

[85]  Kenneth Rose,et al.  Combined source-channel vector quantization using deterministic annealing , 1994, IEEE Trans. Commun..

[86]  Xiaomin Liu,et al.  A Least Biased Fuzzy Clustering Method , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[87]  Steve R. Waterhouse,et al.  Non-linear Prediction of Acoustic Vectors Using Hierarchical Mixtures of Experts , 1994, NIPS.

[88]  Ashok N. Srivastava,et al.  Nonlinear gated experts for time series: discovering regimes and avoiding overfitting , 1995, Int. J. Neural Syst..

[89]  Simon Haykin,et al.  Neural network approaches to image compression , 1995, Proc. IEEE.

[90]  David J. Miller,et al.  An information-theoretic framework for optimization with applications in source coding and pattern recognition , 1995 .

[91]  Geoffrey E. Hinton,et al.  Using Pairs of Data-Points to Define Splits for Decision Trees , 1995, NIPS.

[92]  Yu Hen Hu,et al.  Customized ECG beat classifier using mixture of experts , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[93]  Kenneth Rose,et al.  A global optimization technique for statistical classifier design , 1996, IEEE Trans. Signal Process..

[94]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[95]  Kenneth Rose,et al.  Hierarchical, Unsupervised Learning with Growing via Phase Transitions , 1996, Neural Computation.

[96]  Kenneth Rose,et al.  A generalized VQ method for combined compression and estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[97]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[98]  P. Tavan,et al.  Deterministic annealing for density estimation by multivariate normal mixtures , 1997 .

[99]  Kenneth Rose,et al.  A deterministic annealing approach to discriminative hidden Markov model design , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[100]  Kenneth Rose,et al.  Design of robust HMM speech recognizers using deterministic annealing , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[101]  K. Obermayer,et al.  PHASE TRANSITIONS IN STOCHASTIC SELF-ORGANIZING MAPS , 1997 .

[102]  Kenneth Rose,et al.  Mixture of experts regression modeling by deterministic annealing , 1997, IEEE Trans. Signal Process..

[103]  Kenneth Rose,et al.  Design of pattern recognition systems using deterministic annealing: applications in speech recognition, regression and data compression , 1998 .

[104]  Kenneth Rose,et al.  A Deterministic Annealing Approach for Parsimonious Design of Piecewise Regression Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..