Adaptive Video Highlight Detection by Learning from User History

Recently, there is an increasing interest in highlight detection research where the goal is to create a short duration video from a longer video by extracting its interesting moments. However, most existing methods ignore the fact that the definition of video highlight is highly subjective. Different users may have different preferences of highlight for the same input video. In this paper, we propose a simple yet effective framework that learns to adapt highlight detection to a user by exploiting the user's history in the form of highlights that the user has previously created. Our framework consists of two sub-networks: a fully temporal convolutional highlight detection network $H$ that predicts highlight for an input video and a history encoder network $M$ for user history. We introduce a newly designed temporal-adaptive instance normalization (T-AIN) layer to $H$ where the two sub-networks interact with each other. T-AIN has affine parameters that are predicted from $M$ based on the user history and is responsible for the user-adaptive signal to $H$. Extensive experiments on a large-scale dataset show that our framework can make more accurate and user-specific highlight predictions.

[1]  Eric P. Xing,et al.  Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yannis Kalantidis,et al.  Less Is More: Learning Highlight Detection From Video Duration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrea Vedaldi,et al.  Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[6]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Mubarak Shah,et al.  Query-Focused Extractive Video Summarization , 2016, ECCV.

[8]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[10]  Michael Gygli,et al.  PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation , 2018, ACM Multimedia.

[11]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Amit K. Roy-Chowdhury,et al.  Collaborative Summarization of Topic-Related Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[15]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Radomír Mech,et al.  Sequence-to-Segment Networks for Segment Detection , 2018, NeurIPS.

[17]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael Gygli,et al.  Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks , 2017, 2018 International Conference on Content-Based Multimedia Indexing (CBMI).

[19]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ting Chen,et al.  On Self Modulation for Generative Adversarial Networks , 2018, ICLR.

[22]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  John Zimmerman,et al.  Framework for personalized multimedia summarization , 2005, MIR '05.

[24]  Fumiko Satoh,et al.  Learning personalized video highlights from detailed MPEG-7 metadata , 2002, Proceedings. International Conference on Image Processing.

[25]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Noboru Babaguchi,et al.  User and Device Adaptation for Sports Video Content , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[27]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[28]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[29]  Mohammad Soleymani The Quest for Visual Interest , 2015, ACM Multimedia.

[30]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Noboru Babaguchi,et al.  Learning Personal Preference From Viewer's Operations for Browsing and Its Application to Baseball Video Retrieval and Summarization , 2007, IEEE Transactions on Multimedia.

[32]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[33]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[34]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Changsheng Xu,et al.  Video Highlight Detection via Deep Ranking Modeling , 2017, PSIVT.

[36]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[38]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[39]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[40]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[42]  Luc Van Gool,et al.  Query-adaptive Video Summarization via Quality-aware Relevance Estimation , 2017, ACM Multimedia.

[43]  Gunhee Kim,et al.  A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video , 2018, AAAI.

[44]  Larry S. Davis,et al.  Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior , 2018, ECCV.

[45]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Michael Kampffmeyer,et al.  Query-Conditioned Three-Player Adversarial Network for Video Summarization , 2018, BMVC.

[47]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[48]  Chao Dong,et al.  Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[50]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[51]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jaakko Lehtinen,et al.  Few-Shot Unsupervised Image-to-Image Translation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).