Classification via Minimum Incremental Coding Length

We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We demonstrate the asymptotic optimality of this criterion for Gaussian distributions and analyze its relationships to classical classifiers. The theoretical results clarify the connections between our approach and popular classifiers such as maximum a posteriori (MAP), regularized discriminant analysis (RDA), $k$-nearest neighbor ($k$-NN), and support vector machine (SVM), as well as unsupervised methods based on lossy coding. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. All MATLAB code and classification results are publicly available for peer evaluation at http://perception.csl.uiuc.edu/coding/home.htm.

[1]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[2]  Yann LeCun,et al.  Efficient Pattern Recognition Using a New Transformation Distance , 1992, NIPS.

[3]  Mokshay Madiman,et al.  Minimum description length vs. maximum likelihood in lossy data compression , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[4]  David J. C. MacKay,et al.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[5]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  John Wright,et al.  Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Hermann Ney,et al.  Deformation Models for Image Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[12]  Jon Hamkins,et al.  Gaussian source coding with spherical codes , 2002, IEEE Trans. Inf. Theory.

[13]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[14]  Ronen Basri,et al.  Lambertian Reflectance and Linear Subspaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[16]  David J. Kriegman,et al.  Clustering appearances of objects under varying illumination conditions , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[17]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[18]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[19]  John Langford,et al.  Suboptimal Behavior of Bayes and MDL in Classification Under Misspecification , 2004, COLT.

[20]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[21]  T. Minka Inferring a Gaussian distribution , 2001 .

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  M. Harrison,et al.  A Minimum Description Length Proposal for Lossy Data Compression , 2004 .

[24]  S. Geer,et al.  Regularization in statistics , 2006 .

[25]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[26]  Pietro Perona,et al.  Combining generative models and Fisher kernels for object recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[27]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[29]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[30]  Jia Li,et al.  A source coding approach to classification by vector quantization and the principle of minimum description length , 2002, Proceedings DCC 2002. Data Compression Conference.

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[33]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[34]  Xiaogang Wang,et al.  A unified framework for subspace face recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  J. Friedman Regularized Discriminant Analysis , 1989 .

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.