Hierarchical Multimodal Transformer to Summarize Videos

Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, etc. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT achieves (Fmeasure: 0.441, Kendall’s τ : 0.079, Spearman’s ρ: 0.080) and (F-measure: 0.601, Kendall’s τ : 0.096, Spearman’s ρ: 0.107) on SumMe and TVsum, respectively. It surpasses most of the traditional, RNN-based and attention-based video summarization methods.

[1]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[3]  Xuelong Li,et al.  A Multiview-Based Parameter Free Framework for Group Detection , 2017, AAAI.

[4]  Alan Johnston,et al.  The interrelationship between the face and vocal tract configuration during audiovisual speech , 2020, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[6]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[7]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[8]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Eugenia Koblents,et al.  Video Summarization with LSTM and Deep Attention Models , 2018, MMM.

[10]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[11]  Jinjun Xiong,et al.  Automatic Curation of Sports Highlights Using Multimodal Excitement Features , 2019, IEEE Transactions on Multimedia.

[12]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[13]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  P. Maragos,et al.  STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[16]  Yiyan Chen,et al.  Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning , 2019, MMAsia.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[19]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xiao Liu,et al.  Action Parsing-Driven Video Summarization Based on Reinforcement Learning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[22]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yang Long,et al.  Query Twice: Dual Mixture Attention Meta Learning for Video Summarization , 2020, ACM Multimedia.

[24]  Xuelong Li,et al.  TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization , 2021, IEEE Transactions on Industrial Electronics.

[25]  Jungong Han,et al.  Deep Attentive Video Summarization With Distribution Consistency Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Gyemin Lee,et al.  Hierarchical Model For Long-Length Video Summarization With Adversarially Enhanced Audio/Visual Features , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[27]  Liqiang Nie,et al.  Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.

[28]  Shaohui Mei,et al.  Video summarization via block sparse dictionary selection , 2020, Neurocomputing.

[29]  Frédéric Jurie,et al.  Temporal multimodal fusion for video emotion classification in the wild , 2017, ICMI.

[30]  Jianjun Lei,et al.  Deep video action clustering via spatio-temporal feature learning , 2021, Neurocomputing.

[31]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yu-Chiang Frank Wang,et al.  Transforming Multi-Concept Attention into Video Summarization , 2020, ACCV.

[33]  Luming Zhang,et al.  Exploring global diverse attention via pairwise temporal relation for video summarization , 2020, Pattern Recognit..

[34]  Jia Chen,et al.  Video Captioning with Guidance of Multimodal Latent Topics , 2017, ACM Multimedia.

[35]  Convolutional Hierarchical Attention Network for Query-Focused Video Summarization , 2020, AAAI.

[36]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[37]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[38]  David Menotti,et al.  Zero-Shot Action Recognition in Videos: A Survey , 2019, Neurocomputing.

[39]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[42]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[44]  Ananda S. Chowdhury,et al.  Summarizing egocentric videos using deep features and optimal clustering , 2020, Neurocomputing.

[45]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[46]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[47]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Sheng-hua Zhong,et al.  Dynamic graph convolutional network for multi-video summarization , 2020, Pattern Recognit..

[49]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ya Su,et al.  A Unified Framework for Tracking Based Text Detection and Recognition from Web Videos , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jiebo Luo,et al.  Adaptive Greedy Dictionary Selection for Web Media Summarization , 2017, IEEE Transactions on Image Processing.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[54]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Esa Rahtu,et al.  Rethinking the Evaluation of Video Summaries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[57]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[58]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[59]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Tao Mei,et al.  A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization , 2014, IEEE Transactions on Multimedia.

[61]  Xuelong Li,et al.  Property-Constrained Dual Learning for Video Summarization , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[62]  Shaohui Mei,et al.  L2,0 constrained sparse dictionary selection for video summarization , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[63]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.