Supervised Learning of Quantizer Codebooks by Information Loss Minimization

This paper proposes a technique for jointly quantizing continuous features and the posterior distributions of their class labels based on minimizing empirical information loss such that the quantizer index of a given feature vector approximates a sufficient statistic for its class label. Informally, the quantized representation retains as much information as possible for classifying the feature vector correctly. We derive an alternating minimization procedure for simultaneously learning codebooks in the Euclidean feature space and in the simplex of posterior class distributions. The resulting quantizer can be used to encode unlabeled points outside the training set and to predict their posterior class distributions, and has an elegant interpretation in terms of lossless source coding. The proposed method is validated on synthetic and real data sets and is applied to two diverse problems: learning discriminative visual vocabularies for bag-of-features image classification and image segmentation.

[1]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[2]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[3]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[4]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[5]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  H. Poor,et al.  Fine quantization in signal detection and estimation , 1988, IEEE Trans. Inf. Theory.

[9]  Teuvo Kohonen,et al.  Improved versions of learning vector quantization , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[12]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[13]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[14]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[15]  R. Gray,et al.  Combining Image Compression and Classification Using Vector Quantization , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  Kenneth Rose,et al.  A generalized VQ method for combined compression and estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[19]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[20]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[21]  David Mumford,et al.  Statistics of natural images and models , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[22]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[23]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[24]  T. Linder LEARNING-THEORETIC METHODS IN VECTOR QUANTIZATION , 2002 .

[25]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[26]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[27]  Jitendra Malik,et al.  Learning a classification model for segmentation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[29]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[30]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[31]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[32]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[33]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[34]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[35]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Christoph Schnörr,et al.  Natural Image Statistics for Natural Image Segmentation , 2005, International Journal of Computer Vision.

[37]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[38]  Robert M. Gray,et al.  Lloyd clustering of Gauss mixture models for image compression and classification , 2005, Signal Process. Image Commun..

[39]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[40]  Francesca Odone,et al.  Building kernels from binary strings for image matching , 2005, IEEE Transactions on Image Processing.

[41]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[42]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  Frédéric Jurie,et al.  Latent mixture vocabularies for object categorization and segmentation , 2006, Image Vis. Comput..

[44]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[45]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[46]  Svetlana Lazebnik,et al.  Learning Nearest-Neighbor Quantizers from Labeled Data by Information Loss Minimization , 2007, AISTATS.

[47]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[48]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[49]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[50]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.