Video-to-Shot Tag Propagation by Graph Sparse Group Lasso

Traditional approaches to video tagging are designed to propagate tags at the same level, such as assigning the tags of training videos (or shots) to the test videos (or shots), such as generating tags for the test video when the training videos are associated with the tags at the video-level or assigning tags to the test shot when given a collection of annotated shots. This paper focuses on automatical shot tagging given a collection of videos with the tags at the video-level. In other words, we aim to assign specific tags from the training videos to the test shot. The paper solves the V2S issue by assigning the test shot with the tags deriving from parts of the tags in a part of training videos. To achieve the goal, the paper first proposes a novel Graph Sparse Group Lasso (shorted for GSGL) model to linearly reconstruct the visual feature of the test shot with the visual features of the training videos, i.e., finding the correlation between the test shot and the training videos. The paper then proposes a new tagging propagation rule to assign the video-level tags to the test shot by the learnt correlations. Moreover, to effectively build the reconstruction model, the proposed GSGL simultaneously takes several constraints into account, such as the inter-group sparsity, the intra-group sparsity, the temporal-spatial prior knowledge in the training videos and the local structure of the test shot. Extensive experiments on public video datasets are conducted, which clearly demonstrate the effectiveness of the proposed method for dealing with the video-to-shot tag propagation.

[1]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[2]  Chris H. Q. Ding,et al.  Towards Structural Sparsity: An Explicit l2/l0 Approach , 2010, ICDM.

[3]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[4]  Stéphane Ayache,et al.  TRECVID 2007: Collaborative Annotation using Active Learning , 2007, TRECVID.

[5]  Ullas Gargi,et al.  Solving the label resolution problem in supervised video content classification , 2008, MIR '08.

[6]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[7]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[9]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[10]  Yueting Zhuang,et al.  Active post-refined multimodality video semantic concept detection with tensor representation , 2008, ACM Multimedia.

[11]  Qi Tian,et al.  Multi-label boosting for image annotation by structural grouping sparsity , 2010, ACM Multimedia.

[12]  Jun Yang,et al.  (Un)Reliability of video concept detection , 2008, CIVR '08.

[13]  Chong-Wah Ngo,et al.  Domain adaptive semantic diffusion for large scale context-based video annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Meng Wang,et al.  Visual query suggestion , 2010, ACM Trans. Multim. Comput. Commun. Appl..

[15]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[17]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[18]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[19]  Hai Jin,et al.  Label to region by bi-layer sparsity priors , 2009, MM '09.

[20]  Changhu Wang,et al.  Image annotation refinement using random walk with restarts , 2006, MM '06.

[21]  Zhi-Hua Zhou,et al.  MIML: A Framework for Learning with Ambiguous Objects , 2008, ArXiv.

[22]  Matti Pietikäinen,et al.  Performance evaluation of texture measures with classification based on Kullback discrimination of distributions , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[23]  Zi Huang,et al.  Tag localization with spatial correlations and joint group sparsity , 2011, CVPR 2011.

[24]  Yung-Yu Chuang,et al.  Multi-cue fusion for semantic video indexing , 2008, ACM Multimedia.

[25]  Andrea Ferracani,et al.  Sirio, orione and pan: an integrated web system for ontology-based video search and annotation , 2010, ACM Multimedia.

[26]  Meng Wang,et al.  Correlative multilabel video annotation with temporal kernels , 2008, TOMCCAP.

[27]  Xiaobai Liu,et al.  Label to Region by BiLayer Sparsity Priors , 2009 .

[28]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[29]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[31]  Thomas S. Huang,et al.  Factor graph framework for semantic video indexing , 2002, IEEE Trans. Circuits Syst. Video Technol..

[32]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[33]  Mark Sanderson,et al.  Automatic video tagging using content redundancy , 2009, SIGIR.

[34]  Yi Yang,et al.  Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding , 2012, IEEE Transactions on Image Processing.

[35]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[36]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[37]  Shih-Fu Chang,et al.  Context-Based Concept Fusion with Boosted Conditional Random Fields , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[39]  Marcel Worring,et al.  Learning tag relevance by neighbor voting for social image retrieval , 2008, MIR '08.

[40]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[41]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Zi Huang,et al.  Video-to-shot tag allocation by weighted sparse group lasso , 2011, MM '11.

[43]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[44]  Wen Gao,et al.  Sequence Multi-Labeling: A Unified Video Annotation Scheme With Spatial and Temporal Context , 2010, IEEE Transactions on Multimedia.

[45]  Yuxin Peng,et al.  Refining video annotation by exploiting inter-shot context , 2010, ACM Multimedia.

[46]  Yi Yang,et al.  Interactive Video Indexing With Statistical Active Learning , 2012, IEEE Transactions on Multimedia.

[47]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[48]  Adrian Ulges,et al.  Visual Concept Learning from Weakly Labeled Web Videos , 2010, Video Search and Mining.

[49]  Chong-Wah Ngo,et al.  On the Annotation of Web Videos by Efficient Near-Duplicate Search , 2010, IEEE Transactions on Multimedia.

[50]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[51]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..