Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection

The video highlight detection task is to localize key elements (moments of user's major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain the background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel three-dimensional (3-D) (spatial+temporal) attention model that can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus, the more effective feature of the video segment is obtained to predict the highlight score. The proposed 3-D attention scheme can be easily integrated into a conventional end-to-end deep ranking model that aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube and SumMe datasets demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods. With the proposed 3-D attention model, video highlights can be accurately retrieved in spatial and temporal dimensions without human supervision in several domains, such as gymnastics, parkour, skating, skiing, surfing, and dog activities, on the public datasets.

[1]  Qi Tian,et al.  Enhancing Person Re-identification in a Self-Trained Subspace , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[3]  Weibei Dou,et al.  Combining Short and Long Term Audio Features for TV Sports Highlight Detection , 2006, ECIR.

[4]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Xiaochun Cao,et al.  Fashion Parsing With Video Context , 2015, IEEE Trans. Multim..

[6]  Changsheng Xu,et al.  Weakly Supervised Graph Propagation Towards Collective Image Parsing , 2012, IEEE Transactions on Multimedia.

[7]  Nicu Sebe,et al.  Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks , 2013, IEEE Transactions on Multimedia.

[8]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[9]  M. Shamim Hossain,et al.  Automatic Visual Concept Learning for Social Event Understanding , 2015, IEEE Transactions on Multimedia.

[10]  Deyu Meng,et al.  Co-Saliency Detection via a Self-Paced Multiple-Instance Learning Framework , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[12]  Masaharu Ogawa,et al.  A highlight scene detection and video summarization system using audio feature for a personal video recorder , 2005, IEEE Transactions on Consumer Electronics.

[13]  Changsheng Xu,et al.  Boosted Exemplar Learning for Action Recognition and Annotation , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Changsheng Xu,et al.  A Generic Framework for Video Annotation via Semi-Supervised Learning , 2012, IEEE Transactions on Multimedia.

[16]  Feiping Nie,et al.  Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering , 2017, IEEE Transactions on Image Processing.

[17]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[18]  Wen Gao,et al.  Human Behavior Analysis for Highlight Ranking in Broadcast Racket Sports Video , 2007, IEEE Transactions on Multimedia.

[19]  Vlad I. Morariu,et al.  Summarizing While Recording: Context-Based Highlight Detection for Egocentric Videos , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[20]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[21]  Meng Wang,et al.  Visual Classification by ℓ1-Hypergraph Modeling , 2015, IEEE Trans. Knowl. Data Eng..

[22]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[24]  Xiangjian He,et al.  Hierarchical affective content analysis in arousal and valence dimensions , 2013, Signal Process..

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[27]  Xiaojun Chang,et al.  Revealing Event Saliency in Unconstrained Video Collection , 2017, IEEE Transactions on Image Processing.

[28]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29]  Dacheng Tao,et al.  Multi-View Object Retrieval via Multi-Scale Topic Models. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[30]  Jiashi Feng,et al.  Zero-Shot Learning via Attribute Regression and Class Prototype Rectification. , 2018, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[31]  Surya Nepal,et al.  Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  M. Shamim Hossain,et al.  Deep Relative Attributes , 2016, IEEE Transactions on Multimedia.

[34]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Bu-Sung Lee,et al.  Event on demand with MPEG-21 video adaptation system , 2006, MM '06.

[36]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[37]  Constantine Kotropoulos,et al.  Video summarization based on shot boundary detection with penalized contrasts , 2015, 2015 9th International Symposium on Image and Signal Processing and Analysis (ISPA).

[38]  Yongkun Wang,et al.  An Improved Keyframe Extraction Method Based on HSV Colour Space , 2013, J. Softw..

[39]  Xiangjian He,et al.  A three-level framework for affective content analysis and its case studies , 2014, Multimedia Tools and Applications.

[40]  Changsheng Xu,et al.  Correlation Particle Filter for Visual Tracking , 2018, IEEE Transactions on Image Processing.

[41]  Adrian Ulges,et al.  Keyframe Extraction for Video Tagging & Summarization , 2008, Informatiktage.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[44]  Shuicheng Yan,et al.  Fashion Parsing With Weak Color-Category Labels , 2014, IEEE Transactions on Multimedia.

[45]  Changsheng Xu,et al.  Cross-Domain Feature Learning in Multimedia , 2015, IEEE Transactions on Multimedia.

[46]  Xindong Wu,et al.  Learning on Big Graph: Label Inference and Regularization with Anchor Hierarchy , 2017, IEEE Transactions on Knowledge and Data Engineering.

[47]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Yifan Zhang,et al.  Highlight ranking for sports video browsing , 2005, MULTIMEDIA '05.

[49]  Stephen W. Smoliar,et al.  An integrated system for content-based video retrieval and browsing , 1997, Pattern Recognit..

[50]  Changsheng Xu,et al.  Learning Multi-Task Correlation Particle Filters for Visual Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[52]  Changsheng Xu,et al.  Cross-Domain Multi-Event Tracking via CO-PMHT , 2014, TOMM.

[53]  Meng Wang,et al.  Learning Visual Semantic Relationships for Efficient Visual Retrieval , 2015, IEEE Transactions on Big Data.

[54]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[55]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).