Video Summarization by Learning Relationships between Action and Scene

We propose a novel deep architecture for video summarization in untrimmed videos that simultaneously recognizes action and scene classes for every video segments. Our networks accomplish this through a multi-task fusion approach based on two types of attention modules to explore semantic correlations between action and scene in the videos. The proposed networks consist of the feature embedding networks and attention inference networks to stochastically leverage the inferred action and scene feature representations. Additionally, we design a new center loss function that learns the feature representations by enforcing to minimize the intra-class variations and to maximize the inter-class variations. Our model achieves a score of 0.8409 for summarization and accuracy of 0.7294 for action and scene recognition on test set of CoVieW'19 dataset, which is ranked 3rd.

[1]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael F. Cohen,et al.  Real-time hyperlapse creation via optimal frame selection , 2015, ACM Trans. Graph..

[3]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[5]  Seungryong Kim,et al.  Learning to Detect, Associate, and Recognize Human Actions and Surrounding Scenes in Untrimmed Videos , 2018, CoVieW@MM.

[6]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[8]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[9]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[11]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[13]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tianbao Yang,et al.  How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization , 2018, ECCV.

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[17]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ali Farhadi,et al.  Salient Montages from Unconstrained Videos , 2014, ECCV.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Boqing Gong,et al.  Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[25]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Luc Van Gool,et al.  Viewpoint-Aware Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[28]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[29]  Ami Wiesel,et al.  Learning to Detect , 2018, IEEE Transactions on Signal Processing.

[30]  Bernard Ghanem,et al.  SCC: Semantic Context Cascade for Efficient Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[33]  In-So Kweon,et al.  BAM: Bottleneck Attention Module , 2018, BMVC.

[34]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Michael Cohen,et al.  First-person Hyperlapse Videos , 2014, SIGGRAPH 2014.

[36]  Stephen Lin,et al.  CoVieW'18: The 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild , 2018, ACM Multimedia.

[37]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Shmuel Peleg,et al.  EgoSampling: Fast-forward and stereo for egocentric videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[40]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.