TTH-RNN: Tensor-Train Hierarchical Recurrent Neural Network for Video Summarization

Although a recurrent neural network (RNN) has achieved tremendous advances in video summarization, there are still some problems remaining to be addressed. In this article, we focus on two intractable problems when applying an RNN to video summarization: first the extremely large feature-to-hidden matrices. Since video features are usually in a high-dimensional space, it leads to extremely large feature-to-hidden mapping matrices in the RNN model, which increases the training difficulty. Second, the deficiency in long-range temporal dependence exploration. Most videos contain thousands of frames at least, which is such a long sequence that traditional RNNs cannot deal well with. Facing the abovementioned two problems, we develop a tensor-train hierarchical recurrent neural network (TTH-RNN) for the video summarization task. It contains a tensor-train embedding layer to avert the large feature-to-hidden matrices, together with a hierarchical structure of an RNN to explore the long-range temporal dependence among video frames. Practically, the experimental results on four benchmark datasets, including SumMe, TVsum, MED, and VTW, have demonstrated the excellent performance of a TTH-RNN in video summarization.

[1]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[2]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[3]  Xiaokang Yang,et al.  Deep RNNs for video denoising , 2016, Optical Engineering + Applications.

[4]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[6]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[9]  C. Lubich From Quantum to Classical Molecular Dynamics: Reduced Models and Numerical Analysis , 2008 .

[10]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[12]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[13]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[15]  Juan Carlos Niebles,et al.  Title Generation for User Generated Videos , 2016, ECCV.

[16]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Xiaoqiang Lu,et al.  Key Frame Extraction in the Summary Space , 2018, IEEE Transactions on Cybernetics.

[20]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[21]  Yuan Yuan,et al.  Feature-aware Adaptation and Structured Density Alignment for Crowd Counting in Video Surveillance , 2019, ArXiv.

[22]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[23]  Kichul Kim,et al.  A Single-Chip FPGA Holographic Video Processor , 2019, IEEE Transactions on Industrial Electronics.

[24]  Haopeng Li,et al.  Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network , 2019, IEEE Access.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ian H. Sloan,et al.  Why Are High-Dimensional Finance Problems Often of Low Effective Dimension? , 2005, SIAM J. Sci. Comput..

[28]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Volker Tresp,et al.  Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[33]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[35]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[38]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[39]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Xuelong Li,et al.  Meta Learning for Task-Driven Video Summarization , 2019, IEEE Transactions on Industrial Electronics.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.