Automatic Visual Theme Discovery from Joint Image and Text Corpora

This paper presents an unsupervised visual theme discovery framework as a better (more compact and effective) alternative for semantic representation of visual contents. Firstly, a tag filtering algorithm was proposed focusing on the tag’s ability of visual content description. Then a spectral clustering algorithm is applied to cluster tags into visual themes based on their visual similarity and semantic similarity measures. User studies have been conducted to evaluate the effectiveness and rationality of the discovered visual themes and obtain promising results. Additionally, two common computer vision tasks, example based image search and keyword based image search to explore potential applications of the proposed framework. The experimental results show that visual themes significantly outperform tags on semantic image understanding and achieve state-of-art performance inthese two tasks.

[1]  Victor S. Lempitsky,et al.  Neural Codes for Image Retrieval , 2014, ECCV.

[2]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4]  Guoping Qiu,et al.  Fast semantic image retrieval based on random forest , 2012, ACM Multimedia.

[5]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[6]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Vladimir Pavlovic,et al.  Baselines for Image Annotation , 2010, International Journal of Computer Vision.

[8]  Shu-Ching Chen,et al.  Correlation-based Feature Analysis and Multi-Modality Fusion framework for multimedia semantic retrieval , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[9]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Larry S. Davis,et al.  Selecting Relevant Web Trained Concepts for Automated Event Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[13]  Anil K. Jain,et al.  A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[14]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[19]  Noriji Kato,et al.  Multi-Class Labeling Improved by Random Forest for Automatic Image Annotation , 2011, MVA.

[20]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[21]  Ali Farhadi,et al.  VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ron Shonkwiler Computing the Hausdorff Set Distance in Linear Time for Any L_p Point Distance , 1991, Inf. Process. Lett..

[23]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Qiang Chen,et al.  Multi-label visual classification with label exclusive context , 2011, 2011 International Conference on Computer Vision.

[26]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Shuicheng Yan,et al.  Efficient large-scale image annotation by probabilistic collaborative multi-label propagation , 2010, ACM Multimedia.

[29]  Jeff A. Bilmes,et al.  Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification , 2009, NIPS.

[30]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[31]  Yang Yu,et al.  Automatic image annotation using group sparsity , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Qian Zhang,et al.  Random Forest for Image Annotation , 2012, ECCV.

[33]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Robinson Piramuthu,et al.  ConceptLearner: Discovering visual concepts from weakly labeled image collections , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).