Tag refinement of micro-videos by learning from multiple data sources

Micro-video is an increasingly prevalent social media form, which attracts much attention for its convenient acquisition and expressive ability. However, for the user-generated hashtags of micro-videos have seriously unbalanced distribution and low quality, the management of micro-videos becomes challenging. In this paper, we propose a novel tag refinement approach for micro-videos by learning from multiple public data sources with manually labelled tags, which can overcome the difficulty of directly refining the imprecise hashtags and address the problem of lacking manually labelled micro-video datasets for training. We define a set of target tags by referring to the widely used datasets for object, activity and scene detection. In tag refinement, we firstly transfer the tags from the images in NUS-WIDE to the micro-video keyframes by similarity measurement. Meanwhile, we complete the tags by detecting the objects, activities and scenes in micro-videos based on appearance features and motion features with the assistance of the datasets, namely, ImageNet, PASCAL VOC, HMDB51, UCF50 and SUN. We also denoise the hashtags by constructing the mapping relationships among hashtags and target tags based on the statistics on NUS-WIDE. The results of tag transfer, complement and denoising are finally linearly combined to generate the tag refinement results of micro-videos. To validate the performance, we construct a dataset with 600 micro-videos from Vine, and manually labelled the micro-videos with target tags. The experimental results show that our approach can obtain good performance in tag refinement of micro-videos by learning from multiple data sources.

[1]  Yang Yang,et al.  OBSIR: Object-based stereo image retrieval , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Nicu Sebe,et al.  Optimal graph learning with partial tags and multiple features for image and video annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yongdong Zhang,et al.  Visual stem mapping and Geometric Tense coding for Augmented Visual Vocabulary , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Changsheng Xu,et al.  User-Aware Image Tag Refinement via Ternary Semantic Analysis , 2012, IEEE Transactions on Multimedia.

[5]  Tat-Seng Chua,et al.  Learning from Collective Intelligence , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[6]  Bin Luo,et al.  Salient Object Detection via Video Spatio-Temporal Difference and Coherence , 2016, 2016 12th International Conference on Computational Intelligence and Security (CIS).

[7]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[8]  Jinhui Tang,et al.  Tag ranking based on salient region graph propagation , 2014, Multimedia Systems.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Shuicheng Yan,et al.  Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[11]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[12]  Martin K. Purvis,et al.  Wildlife video key-frame extraction based on novelty detection in semantic context , 2011, Multimedia Tools and Applications.

[13]  Jingyuan Chen,et al.  Multi-Modal Learning: Study on A Large-Scale Micro-Video Data Collection , 2016, ACM Multimedia.

[14]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[15]  Yan Liu,et al.  Visual orientation inhomogeneity based scale-invariant feature transform , 2015, Expert Syst. Appl..

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Rossano Schifanella,et al.  6 Seconds of Sound and Vision: Creativity in Micro-videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Kiyoharu Aizawa,et al.  Degree of loop assessment in microvideo , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[19]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yan Liu,et al.  Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling , 2013, AAAI.

[22]  Yue Gao,et al.  Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval , 2013, ACM Multimedia.

[23]  Huanbo Luan,et al.  Discrete Collaborative Filtering , 2016, SIGIR.

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  H. Zhang,et al.  Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition , 2015, Neurocomputing.

[27]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[28]  John R. Smith,et al.  VideoAnnEx: IBM MPEG-7 Annotation Tool for Multimedia Indexing and Concept Learning , 2003 .

[29]  Mei Chen,et al.  Food recognition using statistics of pairwise local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Xiangyang Xu,et al.  CLSH: Cluster-based Locality-Sensitive Hashing , 2014, ICIMCS '14.

[31]  Rong Yan,et al.  Hybrid Tagging and Browsing Approaches for Efficient Manual Image Annotation , 2009, IEEE MultiMedia.

[32]  Kilian Q. Weinberger,et al.  Resolving tag ambiguity , 2008, ACM Multimedia.

[33]  Tat-Seng Chua,et al.  Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model , 2016, ACM Multimedia.

[34]  Changsheng Xu,et al.  Knowing Verb From Object: Retagging With Transfer Learning on Verb-Object Concept Images , 2015, IEEE Transactions on Multimedia.

[35]  Dong Liu,et al.  Content-based tag processing for Internet social images , 2010, Multimedia Tools and Applications.

[36]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[37]  Yongdong Zhang,et al.  Mining concise and distinctive affine-stable features for object detection in large corpus , 2011, Int. J. Comput. Math..

[38]  Yongdong Zhang,et al.  Salient region detection : Integrate both global and local cues , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[39]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[40]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yongdong Zhang,et al.  Accurate off-line query expansion for large-scale mobile visual search , 2013, Signal Process..

[42]  Tat-Seng Chua,et al.  Shorter-is-Better: Venue Category Estimation from Micro-Video , 2016, ACM Multimedia.

[43]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[44]  Ivor W. Tsang,et al.  Tag-based web photo retrieval improved by batch mode re-tagging , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Jing Liu,et al.  Object proposal on RGB-D images via elastic edge boxes , 2017, Neurocomputing.

[46]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[47]  HuaXian-Sheng,et al.  Content-based tag processing for Internet social images , 2011 .

[48]  Ramesh C. Jain,et al.  Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images , 2011, TIST.

[49]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[50]  Shuicheng Yan,et al.  Robust Image Analysis With Sparse Representation on Quantized Visual Features , 2013, IEEE Transactions on Image Processing.

[51]  Changsheng Xu,et al.  Tag-aware image classification via Nested Deep Belief nets , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[52]  Yan Liu,et al.  How important is location information in saliency detection of natural images , 2015, Multimedia Tools and Applications.

[53]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[54]  Yue Gao,et al.  Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval , 2016, IEEE Transactions on Image Processing.

[55]  Bingbing Ni,et al.  Assistive tagging: A survey of multimedia tagging with human-computer joint exploration , 2012, CSUR.