Bag-of-multimedia-words for image classification

We introduce the bag-of-multimedia-words model that tightly combines the heterogeneous information coming from the text and the pixel-based information of a multimedia document. The proposed multimedia feature generation process is generic for any multi-modality and aims at enriching a multimedia document description with compact and discriminative signatures well appropriate to linear classifiers. It is evaluated on the Pascal VOC 2007 classification challenge, outperforming the state-of-the-art bag-of-visual-words or bag-of-tag-words based classification approaches.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Chong-Wah Ngo,et al.  Keyframe Retrieval by Keypoints: Can Point-to-Point Matching Help? , 2006, CIVR.

[3]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[4]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[6]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[7]  Derek Hoiem,et al.  Building text features for object image classification , 2009, CVPR.

[8]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[12]  Giovanni Maria Farinella,et al.  Exploiting Textons Distributions on Spatial Hierarchy for Scene Classification , 2010, EURASIP J. Image Video Process..

[13]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Shih-Fu Chang,et al.  To search or to label?: predicting the performance of search-based automatic image classifiers , 2006, MIR '06.

[17]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[21]  Tieniu Tan,et al.  Salient coding for image classification , 2011, CVPR 2011.