Web Video Event Recognition by Semantic Analysis From Ubiquitous Documents

In recent years, the task of event recognition from videos has attracted increasing interest in multimedia area. While most of the existing research was mainly focused on exploring visual cues to handle relatively small-granular events, it is difficult to directly analyze video content without any prior knowledge. Therefore, synthesizing both the visual and semantic analysis is a natural way for video event understanding. In this paper, we study the problem of Web video event recognition, where Web videos often describe large-granular events and carry limited textual information. Key challenges include how to accurately represent event semantics from incomplete textual information and how to effectively explore the correlation between visual and textual cues for video event understanding. We propose a novel framework to perform complex event recognition from Web videos. In order to compensate the insufficient expressive power of visual cues, we construct an event knowledge base by deeply mining semantic information from ubiquitous Web documents. This event knowledge base is capable of describing each event with comprehensive semantics. By utilizing this base, the textual cues for a video can be significantly enriched. Furthermore, we introduce a two-view adaptive regression model, which explores the intrinsic correlation between the visual and textual cues of the videos to learn reliable classifiers. Extensive experiments on two real-world video data sets show the effectiveness of our proposed framework and prove that the event knowledge base indeed helps improve the performance of Web video event recognition.

[1]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[2]  Zhiwu Lu,et al.  Unified Constraint Propagation on Multi-View Data , 2013, AAAI.

[3]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[4]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[5]  Jintao Zhang,et al.  Inductive multi-task learning with multiple view data , 2012, KDD.

[6]  Yi Yang,et al.  Semi-Supervised Multiple Feature Analysis for Action Recognition , 2014, IEEE Transactions on Multimedia.

[7]  Weiran Xu,et al.  A feature-enhanced smoothing method for LDA model applied to text classification , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[8]  Fei Wang,et al.  Semi-supervised learning with mixed knowledge information , 2012, KDD.

[9]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[10]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[11]  Nicu Sebe,et al.  Feature Weighting via Optimal Thresholding for Video Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Nicu Sebe,et al.  Multimedia Event Detection Using A Classifier-Specific Intermediate Representation , 2013, IEEE Transactions on Multimedia.

[14]  Zi Huang,et al.  Statistical summarization of content features for fast near-duplicate video detection , 2007, ACM Multimedia.

[15]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[16]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[17]  Dan Zhang,et al.  Multi-view transfer learning with a large margin approach , 2011, KDD.

[18]  Bernt Schiele,et al.  Learning using privileged information: SV M+ and weighted SVM , 2013, Neural Networks.

[19]  Yunde Jia,et al.  Cross-View Action Recognition over Heterogeneous Feature Spaces , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Zi Huang,et al.  Near-duplicate Video Retrieval: Current Research and Future Trends - Withdrawn , 2011, IEEE MultiMedia.

[22]  Yuhong Guo,et al.  Convex Subspace Representation Learning from Multi-View Data , 2013, AAAI.

[23]  Xuelong Li,et al.  Visual Coding in a Semantic Hierarchy , 2015, ACM Multimedia.

[24]  Dong Xu,et al.  Event Recognition in Videos by Learning from Heterogeneous Web Sources , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Wotao Yin,et al.  A feasible method for optimization with orthogonality constraints , 2013, Math. Program..

[26]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[27]  Yongdong Zhang,et al.  Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base , 2015, IEEE Transactions on Multimedia.

[28]  Zi Huang,et al.  Near-duplicate video retrieval: Current research and future trends , 2013, CSUR.

[29]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Lei Xu,et al.  Data smoothing regularization, multi-sets-learning, and problem solving strategies , 2003, Neural Networks.

[31]  Yi Yang,et al.  On the Influence Propagation of Web Videos , 2014, IEEE Transactions on Knowledge and Data Engineering.

[32]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[33]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Yong Luo,et al.  Vector-Valued Multi-View Semi-Supervsed Learning for Multi-Label Image Classification , 2013, AAAI.

[35]  Cordelia Schmid,et al.  Event Retrieval in Large Video Collections with Circulant Temporal Encoding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Ming Yang,et al.  Multi-view learning from imperfect tagging , 2012, ACM Multimedia.

[37]  Dan Zhang,et al.  MI2LS: multi-instance learning from multiple informationsources , 2013, KDD.

[38]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[40]  Wei-Yun Yau,et al.  Metadata enrichment for news video retrieval: a graph-based propagation approach , 2013, MM '13.

[41]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yue Gao,et al.  Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss , 2014, IEEE Transactions on Multimedia.