VISIR: Visual and Semantic Image Label Refinement

The social media explosion has populated the Internet with a wealth of images. There are two existing paradigms for image retrieval: 1)content-based image retrieval (BIR), which has traditionally used visual features for similarity search (e.g., SIFT features), and 2) tag-based image retrieval (TBIR), which has relied on user tagging (e.g., Flickr tags). CBIR now gains semantic expressiveness by advances in deep-learning-based detection of visual labels. TBIR benefits from query-and-click logs to automatically infer more informative labels. However, learning-based tagging still yields noisy labels and is restricted to concrete objects, missing out on generalizations and abstractions. Click-based tagging is limited to terms that appear in the textual context of an image or in queries that lead to a click. This paper addresses the above limitations by semantically refining and expanding the labels suggested by learning-based object detection. We consider the semantic coherence between the labels for different objects, leverage lexical and commonsense knowledge, and cast the label assignment into a constrained optimization problem solved by an integer linear program. Experiments show that our method, called VISIR, improves the quality of the state-of-the-art visual labeling tools like LSDA and YOLO.

[1]  Cristian Sminchisescu,et al.  Object Recognition by Sequential Figure-Ground Ranking , 2011, International Journal of Computer Vision.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Lei Wu,et al.  Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  BertiniMarco,et al.  Socializing the Semantic Gap , 2016 .

[5]  Erik T. Mueller,et al.  Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[6]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[7]  Dong Liu,et al.  Image Retagging Using Collaborative Tag Propagation , 2011, IEEE Transactions on Multimedia.

[8]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Àgata Lapedriza,et al.  Emotion Recognition in Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Henry Lieberman,et al.  Robust Photo Retrieval Using World Semantics , 2002 .

[12]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Dong Liu,et al.  Retagging social images based on visual and semantic consistency , 2010, WWW '10.

[14]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[15]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[16]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Alberto Del Bimbo,et al.  Socializing the Semantic Gap , 2015, ACM Comput. Surv..

[19]  Andreas Hotho,et al.  Social Tagging Recommender Systems , 2011, Recommender Systems Handbook.

[20]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Trevor Darrell,et al.  LSDA: Large Scale Detection through Adaptation , 2014, NIPS.

[23]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[24]  Gerhard Weikum,et al.  Know2Look: Commonsense Knowledge for Visual Search , 2016, AKBC@NAACL-HLT.

[25]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[26]  Gerhard Weikum,et al.  WebChild: harvesting and organizing commonsense knowledge from the web , 2014, WSDM.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Samy Bengio,et al.  Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Valentin Robu,et al.  The complex dynamics of collaborative tagging , 2007, WWW '07.

[30]  Hao Xu,et al.  Tag refinement by regularized LDA , 2009, ACM Multimedia.

[31]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[34]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[36]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[37]  Li Fei-Fei,et al.  Crowdsourcing in Computer Vision , 2016, Found. Trends Comput. Graph. Vis..

[38]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[39]  Rong Jin,et al.  Image Tag Completion by Noisy Matrix Recovery , 2014, ECCV.

[40]  Xian-Sheng Hua,et al.  Tell me what , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[41]  Yueting Zhuang,et al.  Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[42]  Tao Chen,et al.  Object-Based Visual Sentiment Concept Analysis and Application , 2014, ACM Multimedia.

[43]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[45]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46]  Fatos T. Yarman-Vural,et al.  Automatic Image Annotation by Ensemble of Visual Descriptors , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[49]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[50]  Shuicheng Yan,et al.  Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[51]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.