In defense of soft-assignment coding

In object recognition, soft-assignment coding enjoys computational efficiency and conceptual simplicity. However, its classification performance is inferior to the newly developed sparse or local coding schemes. It would be highly desirable if its classification performance could become comparable to the state-of-the-art, leading to a coding scheme which perfectly combines computational efficiency and classification performance. To achieve this, we revisit soft-assignment coding from two key aspects: classification performance and probabilistic interpretation. For the first aspect, we argue that the inferiority of soft-assignment coding is due to its neglect of the underlying manifold structure of local features. To remedy this, we propose a simple modification to localize the soft-assignment coding, which surprisingly achieves comparable or even better performance than existing sparse or local coding schemes while maintaining its computational advantage. For the second aspect, based on our probabilistic interpretation of the soft-assignment coding, we give a probabilistic explanation to the magic max-pooling operation, which has successfully been used by sparse or local coding schemes but still poorly understood. This probability explanation motivates us to develop a new mix-order max-pooling operation which further improves the classification performance of the proposed coding scheme. As experimentally demonstrated, the localized soft-assignment coding achieves the state-of-the-art classification performance with the highest computational efficiency among the existing coding schemes.

[1]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[3]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Thomas S. Huang,et al.  Efficient Highly Over-Complete Sparse Coding Using a Mixture Model , 2010, ECCV.

[6]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[14]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[15]  J.K. Hedrick,et al.  Experimentation with a vehicle platoon control system , 1991, Vehicle Navigation and Information Systems Conference, 1991.

[16]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[18]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[19]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.