Multi-Level Spatiotemporal Network for Video Summarization

With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.

[1]  Ioannis Patras,et al.  Combining Global and Local Attention with Positional Encoding for Video Summarization , 2021, 2021 IEEE International Symposium on Multimedia (ISM).

[2]  Chuan Li,et al.  Multiple Pairwise Ranking Networks for Personalized Video Summarization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jie Zhou,et al.  DSNet: A Flexible Detect-to-Summarize Network for Video Summarization , 2020, IEEE Transactions on Image Processing.

[4]  Ioannis Patras,et al.  Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods , 2020, ACM Multimedia.

[5]  Luming Zhang,et al.  Exploring global diverse attention via pairwise temporal relation for video summarization , 2020, Pattern Recognit..

[6]  Ling Shao,et al.  Deep attentive and semantic preserving video summarization , 2020, Neurocomputing.

[7]  Tieniu Tan,et al.  Stacked Memory Network for Video Summarization , 2019, ACM Multimedia.

[8]  Fu-En Yang,et al.  Learning Hierarchical Self-Attention for Video Summarization , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[9]  Wei-Ta Chu,et al.  Spatiotemporal Modeling and Label Distribution Learning for Video Summarization , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[10]  Ali Borji,et al.  Video Summarization Via Actionness Ranking , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[12]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[13]  Wei Zhang,et al.  Extractive Video Summarizer with Memory Augmented Neural Networks , 2018, ACM Multimedia.

[14]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[15]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[17]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[18]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[19]  Changsheng Xu,et al.  Video Highlight Detection via Deep Ranking Modeling , 2017, PSIVT.

[20]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[21]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[25]  Gunhee Kim,et al.  Storyline Representation of Egocentric Videos with an Applications to Story-Based Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[31]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[32]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[33]  Eric P. Xing,et al.  Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Hao Tang,et al.  Detecting highlights in sports videos: Cricket as a test case , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[37]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[39]  Regunathan Radhakrishnan,et al.  Highlights extraction from sports video based on an audio-visual marker detection framework , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[40]  Chng Eng Siong,et al.  Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[41]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.