Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss

Semantic video indexing, also known as video annotation or video concept detection in literatures, has been attracting significant attention in recent years. Due to deficiency of labeled training videos, most of the existing approaches can hardly achieve satisfactory performance. In this paper, we propose a novel semantic video indexing approach, which exploits the abundant user-tagged Web images to help learn robust semantic video indexing classifiers. The following two major challenges are well studied: 1) noisy Web images with imprecise and/or incomplete tags; and 2) domain difference between images and videos. Specifically, we first apply a non-parametric approach to estimate the probabilities of images being correctly tagged as confidence scores. We then develop a robust transfer video indexing (RTVI) model to learn reliable classifiers from a limited number of training videos together with the abundance of user-tagged images. The RTVI model is equipped with a novel sample-specific robust loss function, which employs the confidence score of a Web image as prior knowledge to suppress the influence and control the contribution of this image in the learning process. Meanwhile, the RTVI model discovers an optimal kernel space, in which the mismatch between images and videos is minimized for tackling the domain difference problem. Besides, we devise an iterative algorithm to effectively optimize the proposed RTVI model and a theoretical analysis on the convergence of the proposed algorithm is provided as well. Extensive experiments on various real-world multimedia collections demonstrate the effectiveness of the proposed robust semantic video indexing approach.

[1]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[2]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Shih-Fu Chang,et al.  Cross-domain learning methods for high-level visual concept classification , 2008, 2008 15th IEEE International Conference on Image Processing.

[4]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[5]  Yang Yang,et al.  Robust Semantic Video Indexing by Harvesting Web Images , 2013, MMM.

[6]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[7]  Hao Xu,et al.  Tag refinement by regularized LDA , 2009, ACM Multimedia.

[8]  Zi Huang,et al.  Local image tagging via graph regularized joint group sparsity , 2013, Pattern Recognit..

[9]  Zi Huang,et al.  Tag localization with spatial correlations and joint group sparsity , 2011, CVPR 2011.

[10]  Motoaki Kawanabe,et al.  Insights from Classifying Visual Concepts with Multiple Kernel Learning , 2011, PloS one.

[11]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[12]  Yi Yang,et al.  Interactive Video Indexing With Statistical Active Learning , 2012, IEEE Transactions on Multimedia.

[13]  Meng Wang,et al.  Robust Non-negative Graph Embedding: Towards noisy data, unreliable graphs, and noisy labels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Dong Liu,et al.  Tag ranking , 2009, WWW '09.

[15]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[16]  Chong-Wah Ngo,et al.  Sampling and Ontologically Pooling Web Images for Visual Concept Learning , 2012, IEEE Transactions on Multimedia.

[17]  John R. Smith,et al.  MPEG-7 video automatic labeling system , 2003, MULTIMEDIA '03.

[18]  Nicu Sebe,et al.  Harnessing Lab Knowledge for Real-World Action Recognition , 2014, International Journal of Computer Vision.

[19]  Yue Gao,et al.  Corrections to "Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss" , 2015, IEEE Trans. Multim..

[20]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[21]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[22]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[23]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Classification , 2011, AAAI.

[24]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[25]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[26]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Chong-Wah Ngo,et al.  Domain adaptive semantic diffusion for large scale context-based video annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Feiping Nie,et al.  Joint Schatten $$p$$p-norm and $$\ell _p$$ℓp-norm robust matrix completion for missing value recovery , 2013, Knowledge and Information Systems.

[29]  Fei-Fei Li,et al.  OPTIMOL: Automatic Online Picture Collection via Incremental Model Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Ivor W. Tsang,et al.  Domain Transfer SVM for video concept detection , 2009, CVPR 2009.

[31]  Yue Gao,et al.  Brand Data Gathering From Live Social Media Streams , 2014, ICMR.

[32]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[33]  M. Kloft,et al.  Norm Multiple Kernel Learning , 2011 .

[34]  Yi Yang,et al.  Effective transfer tagging from image to video , 2013, TOMCCAP.

[35]  Marcel Worring,et al.  Learning Social Tag Relevance by Neighbor Voting , 2009, IEEE Transactions on Multimedia.

[36]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[37]  Meng Wang,et al.  Parallel Lasso for Large-Scale Video Concept Detection , 2012, IEEE Transactions on Multimedia.

[38]  Shuicheng Yan,et al.  Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[39]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[40]  Ivor W. Tsang,et al.  Domain Transfer Multiple Kernel Learning , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Qi Tian,et al.  Learning heterogeneous data for hierarchical web video classification , 2011, MM '11.

[42]  Charu C. Aggarwal,et al.  An Introduction to Outlier Analysis , 2013 .

[43]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[44]  Chong-Wah Ngo,et al.  Semantic context transfer across heterogeneous sources for domain adaptive video search , 2009, ACM Multimedia.

[45]  Xirong Li,et al.  Classifying tag relevance with relevant positive and negative examples , 2013, ACM Multimedia.

[46]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[47]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.