A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  M. Bartlett Further aspects of the theory of multiple regression , 1938, Mathematical Proceedings of the Cambridge Philosophical Society.

[3]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[4]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[5]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[7]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[11]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[12]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[13]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[14]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[15]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[16]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[19]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[20]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[21]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[22]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[23]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[24]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[25]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[26]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[27]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[29]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[30]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[31]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[32]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Nuno Vasconcelos,et al.  Bridging the Gap: Query by Semantic Example , 2007, IEEE Transactions on Multimedia.

[34]  Trevor Darrell,et al.  Learning Visual Representations using Images with Captions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[36]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[37]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[38]  Gustavo Carneiro,et al.  Supervised Learning of Semantic Classes for Image Annotation and Retrieval , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[41]  Svetlana Lazebnik,et al.  Computing iconic summaries of general visual concepts , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[42]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Different Scenes , 2008, ECCV.

[45]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[46]  Vladimir Pavlovic,et al.  A New Baseline for Image Annotation , 2008, ECCV.

[47]  Wei-Ying Ma,et al.  Annotating Images by Mining Image Search Results , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Subhransu Maji,et al.  Max-margin additive classifiers for detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[49]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[50]  Ivor W. Tsang,et al.  Using large-scale web data to facilitate textual query based retrieval of consumer photos , 2009, MM '09.

[51]  Alexander C. Berg,et al.  Finding iconic images , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[52]  Derek Hoiem,et al.  Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[54]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[56]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[57]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[58]  Gang Wang,et al.  Learning image similarity from Flickr groups using Stochastic Intersection Kernel MAchines , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[59]  John Langford,et al.  Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[60]  Hal Daumé,et al.  Multi-Label Prediction via Sparse Infinite CCA , 2009, NIPS.

[61]  Vasant Honavar,et al.  Multiple label prediction for image annotation with multiple Kernel correlation models , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[62]  Jianping Fan,et al.  Harvesting large-scale weakly-tagged image databases from the web , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[63]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Frédéric Jurie,et al.  Improving web image search results using query-relative classifiers , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[65]  Florent Perronnin,et al.  Large-scale image categorization with explicit data embedding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[67]  Kristen Grauman,et al.  Accounting for the Relative Importance of Objects in Image Retrieval , 2010, BMVC.

[68]  Mitesh M. Khapra,et al.  Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search , 2010, NAACL.

[69]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[70]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[71]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[72]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[73]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[74]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[75]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[76]  Qiang Chen,et al.  Multi-label visual classification with label exclusive context , 2011, 2011 International Conference on Computer Vision.

[77]  Jeff G. Schneider,et al.  Multi-Label Output Codes using Canonical Correlation Analysis , 2011, AISTATS.

[78]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[79]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[80]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Christoph H. Lampert,et al.  Learning Multi-View Neighborhood Preserving Projections , 2011, ICML.

[82]  Kristen Grauman,et al.  Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search , 2011, International Journal of Computer Vision.

[83]  Fuchun Sun,et al.  Large-Margin Predictive Latent Subspace Learning for Multiview Data Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Ernest Valveny,et al.  Leveraging category-level labels for instance-level image retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[85]  Gabriela Csurka,et al.  Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost , 2012, ECCV.

[86]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  Jason Weston,et al.  Joint Image and Word Sense Discrimination for Image Retrieval , 2012, ECCV.

[88]  C. V. Jawahar,et al.  Image Annotation Using Metric Learning in Semantic Neighbourhoods , 2012, ECCV.

[89]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.