AudioVisual Video Summarization

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

[1]  Xuelong Li,et al.  Property-Constrained Dual Learning for Video Summarization , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xuelong Li,et al.  A Multiview-Based Parameter Free Framework for Group Detection , 2017, AAAI.

[4]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[5]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jiebo Luo,et al.  Adaptive Greedy Dictionary Selection for Web Media Summarization , 2017, IEEE Transactions on Image Processing.

[7]  Shaohui Mei,et al.  L2,0 constrained sparse dictionary selection for video summarization , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Rushil Anirudh,et al.  Diversity promoting online sampling for streaming video summarization , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Xiaoqiang Lu,et al.  Key Frame Extraction in the Summary Space , 2018, IEEE Transactions on Cybernetics.

[10]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[11]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[12]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[14]  Sung Wook Baik,et al.  Adaptive key frame extraction for video summarization using an aggregation mechanism , 2012, J. Vis. Commun. Image Represent..

[15]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[17]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[18]  Jinhui Tang,et al.  Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Eugenia Koblents,et al.  Video Summarization with LSTM and Deep Attention Models , 2018, MMM.

[20]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[21]  Xuelong Li,et al.  TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization , 2021, IEEE Transactions on Industrial Electronics.

[22]  Shaohui Mei,et al.  Video summarization via block sparse dictionary selection , 2020, Neurocomputing.

[23]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Rui Wang,et al.  Deep Audio-visual Learning: A Survey , 2020, International Journal of Automation and Computing.

[25]  Jie Zhou,et al.  DSNet: A Flexible Detect-to-Summarize Network for Video Summarization , 2020, IEEE Transactions on Image Processing.

[26]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[27]  Yiyan Chen,et al.  Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning , 2019, MMAsia.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[31]  Xuelong Li,et al.  Locality Adaptive Discriminant Analysis , 2017, IJCAI.

[32]  Ioannis Patras,et al.  AC-SUM-GAN: Connecting Actor-Critic and Generative Adversarial Networks for Unsupervised Video Summarization , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[33]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[34]  Jinchang Ren,et al.  Activity-driven content adaptation for effective video summarization , 2010, J. Vis. Commun. Image Represent..

[35]  Junyu Gao,et al.  Unsupervised Video Summarization via Relation-Aware Assignment Learning , 2021, IEEE Transactions on Multimedia.

[36]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[37]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[39]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  Fuchun Sun,et al.  Deep Multimodal Fusion by Channel Exchanging , 2020, NeurIPS.

[41]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[42]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Esa Rahtu,et al.  Rethinking the Evaluation of Video Summaries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yuan Yuan,et al.  Variational Context-Deformable ConvNets for Indoor Scene Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[46]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[51]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[52]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.

[53]  Paolo Remagnino,et al.  Summarizing Videos with Attention , 2018, ACCV Workshops.

[54]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[55]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Jungong Han,et al.  Deep Attentive Video Summarization With Distribution Consistency Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[57]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.