论文信息 - VISIR: Visual and Semantic Image Label Refinement

VISIR: Visual and Semantic Image Label Refinement

The social media explosion has populated the Internet with a wealth of images. There are two existing paradigms for image retrieval: 1)content-based image retrieval (BIR), which has traditionally used visual features for similarity search (e.g., SIFT features), and 2) tag-based image retrieval (TBIR), which has relied on user tagging (e.g., Flickr tags). CBIR now gains semantic expressiveness by advances in deep-learning-based detection of visual labels. TBIR benefits from query-and-click logs to automatically infer more informative labels. However, learning-based tagging still yields noisy labels and is restricted to concrete objects, missing out on generalizations and abstractions. Click-based tagging is limited to terms that appear in the textual context of an image or in queries that lead to a click. This paper addresses the above limitations by semantically refining and expanding the labels suggested by learning-based object detection. We consider the semantic coherence between the labels for different objects, leverage lexical and commonsense knowledge, and cast the label assignment into a constrained optimization problem solved by an integer linear program. Experiments show that our method, called VISIR, improves the quality of the state-of-the-art visual labeling tools like LSDA and YOLO.

[1] Cristian Sminchisescu,et al. Object Recognition by Sequential Figure-Ground Ranking , 2011, International Journal of Computer Vision.

[2] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3] Lei Wu,et al. Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] BertiniMarco,et al. Socializing the Semantic Gap , 2016 .

[5] Erik T. Mueller,et al. Open Mind Common Sense: Knowledge Acquisition from the General Public , 2002, OTM.

[6] Nick Craswell,et al. Random walks on the click graph , 2007, SIGIR.

[7] Dong Liu,et al. Image Retagging Using Collaborative Tag Propagation , 2011, IEEE Transactions on Multimedia.

[8] Nicu Sebe,et al. Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[9] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10] Àgata Lapedriza,et al. Emotion Recognition in Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Henry Lieberman,et al. Robust Photo Retrieval Using World Semantics , 2002 .

[12] Sanja Fidler,et al. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Dong Liu,et al. Retagging social images based on visual and semantic consistency , 2010, WWW '10.

[14] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[15] Marcel Worring,et al. Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[16] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Alexei A. Efros,et al. An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Alberto Del Bimbo,et al. Socializing the Semantic Gap , 2015, ACM Comput. Surv..

[19] Andreas Hotho,et al. Social Tagging Recommender Systems , 2011, Recommender Systems Handbook.

[20] Andrea Vedaldi,et al. Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21] Ali Farhadi,et al. Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Trevor Darrell,et al. LSDA: Large Scale Detection through Adaptation , 2014, NIPS.

[23] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[24] Gerhard Weikum,et al. Know2Look: Commonsense Knowledge for Visual Search , 2016, AKBC@NAACL-HLT.

[25] Hugo Liu,et al. ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[26] Gerhard Weikum,et al. WebChild: harvesting and organizing commonsense knowledge from the web , 2014, WSDM.

[27] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Samy Bengio,et al. Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Valentin Robu,et al. The complex dynamics of collaborative tagging , 2007, WWW '07.

[30] Hao Xu,et al. Tag refinement by regularized LDA , 2009, ACM Multimedia.

[31] Wei Xu,et al. CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Catherine Havasi,et al. Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[34] Andrew Zisserman,et al. Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35] James Ze Wang,et al. Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[36] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[37] Li Fei-Fei,et al. Crowdsourcing in Computer Vision , 2016, Found. Trends Comput. Graph. Vis..

[38] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[39] Rong Jin,et al. Image Tag Completion by Noisy Matrix Recovery , 2014, ECCV.

[40] Xian-Sheng Hua,et al. Tell me what , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[41] Yueting Zhuang,et al. Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[42] Tao Chen,et al. Object-Based Visual Sentiment Concept Analysis and Application , 2014, ACM Multimedia.

[43] Jitendra Malik,et al. Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[45] Cordelia Schmid,et al. TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46] Fatos T. Yarman-Vural,et al. Automatic Image Annotation by Ensemble of Visual Descriptors , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47] Ali Farhadi,et al. YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Xuelong Li,et al. Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[49] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[50] Shuicheng Yan,et al. Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[51] Dumitru Erhan,et al. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.