Semantic-enriched visual vocabulary construction in a weakly supervised context

One of the prevalent learning tasks involving images is content-based image classification. This is a difficult task especially because the low-level features used to digitally describe images usually capture little information about the semantics of the images. In this paper, we tackle this difficulty by enriching the semantic content of the image representation by using external knowledge. The underlying hypothesis of our work is that creating a more semantically rich representation for images would yield higher machine learning performances, without the need to modify the learning algorithms themselves. The external semantic information is presented under the form of non-positional image labels, therefore positioning our work in a weakly supervised context. Two approaches are proposed: the first one leverages the labels into the visual vocabulary construction algorithm, the result being dedicated visual vocabularies. The second approach adds a filtering phase as a pre-processing of the vocabulary construction. Known positive and known negative sets are constructed and features that are unlikely to be associated with the objects denoted by the labels are filtered. We apply our proposition to the task of content-based image classification and we show that semantically enriching the image representation yields higher classification performances than the baseline representation.

[1]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[2]  Michael Brady,et al.  Saliency, Scale and Image Description , 2001, International Journal of Computer Vision.

[3]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[4]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[5]  Jianjia Zhang,et al.  Combined Category Visual Vocabulary: A new approach to visual vocabulary construction , 2011, 2011 4th International Congress on Image and Signal Processing.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Cordelia Schmid,et al.  A sparse texture representation using affine-invariant regions , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  Cordelia Schmid,et al.  Affine-invariant local descriptors and neighborhood statistics for texture recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Randal C. Nelson,et al.  Semi-Supervised Learning of Visual Classifiers from Web Images and Text , 2009, IJCAI.

[12]  Wen Gao,et al.  Towards semantic embedding in visual vocabulary , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[14]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[15]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[16]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[18]  Bruce A. Draper,et al.  Introduction to the Bag of Features Paradigm for Image Classification and Retrieval , 2011, ArXiv.

[19]  Patrick Gros,et al.  Factorial Correspondence Analysis for image retrieval , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[21]  Yiannis Kompatsiaris,et al.  Using a Multimedia Ontology Infrastructure for Semantic Annotation of Multimedia Content , 2005, SemAnnot@ISWC.

[22]  Gang Wang,et al.  Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[25]  Mubarak Shah,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Shih-Fu Chang,et al.  Visual Cue Cluster Construction via Information Bottleneck Principle and Kernel Density Estimation , 2005, CIVR.

[27]  Stéphane Lallich,et al.  Unsupervised feature construction for improving data representation and semantics , 2013, Journal of Intelligent Information Systems.

[28]  A. de Medeiros Martins,et al.  A new method for multi-texture segmentation using neural networks , 2002 .

[29]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[30]  Tony Lindeberg,et al.  Detecting salient blob-like image structures and their scales with a scale-space primal sketch: A method for focus-of-attention , 1993, International Journal of Computer Vision.

[31]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  Thomas G. Dietterich,et al.  Learning non-redundant codebooks for classifying complex objects , 2009, ICML '09.

[33]  Tinne Tuytelaars,et al.  Towards a more discriminative and semantic visual vocabulary , 2011, Comput. Vis. Image Underst..

[34]  Cordelia Schmid,et al.  Spatial Weighting for Bag-of-Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[40]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[41]  Andrew Zisserman,et al.  Texture classification: are filter banks necessary? , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[42]  Bernt Schiele,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Semantic Modeling of Natural Scenes for Content-Based Image Retrieval , 2022 .

[43]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[44]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[45]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.

[46]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[47]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[48]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[49]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[50]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[51]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[52]  Kristen Grauman,et al.  Watch, Listen & Learn: Co-training on Captioned Images and Videos , 2008, ECML/PKDD.

[53]  G. Camara-Chavez,et al.  An interactive video content-based retrieval system , 2008, 2008 15th International Conference on Systems, Signals and Image Processing.

[54]  Joni-Kristian Kämäräinen,et al.  Making Visual Object Categorization More Challenging: Randomized Caltech-101 Data Set , 2010, 2010 20th International Conference on Pattern Recognition.

[55]  Stefano Soatto,et al.  Localizing Objects with Smart Dictionaries , 2008, ECCV.

[56]  R. Haralick,et al.  Computer Classification of Reservoir Sandstones , 1973 .

[57]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[58]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[59]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .