Self-Supervised Learning to Detect Key Frames in Videos

Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method.

[1]  Xiaoqiang Lu,et al.  Key Frame Extraction in the Summary Space , 2018, IEEE Transactions on Cybernetics.

[2]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Hiroshi Nagahashi,et al.  Real-time Action Recognition Based on Key Frame Detection , 2017, ICMLC.

[4]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[5]  Xiao Ke,et al.  End-to-End Automatic Image Annotation Based on Deep CNN and Multi-Label Data Augmentation , 2019, IEEE Transactions on Multimedia.

[6]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Dan Xu,et al.  Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[11]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[13]  Shaohui Mei,et al.  Video summarization via block sparse dictionary selection , 2020, Neurocomputing.

[14]  Neelam Sinha,et al.  Video Key Frame Detection Using Block Sparse Coding , 2018, CVIP.

[15]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[16]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Amanda Berg,et al.  Semi-Automatic Annotation of Objects in Visual-Thermal Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[18]  Sanja Fidler,et al.  Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[20]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[21]  Cheng Huang,et al.  A Novel Key-Frames Selection Framework for Comprehensive Video Summarization , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Samuel Rota Bulo,et al.  Learning Multi-Object Tracking and Segmentation From Automatic Annotations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nicu Sebe,et al.  Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion , 2019, Neurocomputing.

[24]  Ioannis Pitas,et al.  Information theory-based shot cut/fade detection and video summarization , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Zhe L. Lin,et al.  Best Frame Selection in a Short Video , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Weiwei Liu,et al.  Generating Realistic Videos From Keyframes With Concatenated GANs , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Jungong Han,et al.  Deep Attentive Video Summarization With Distribution Consistency Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[29]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[30]  Kristin Branson,et al.  Detecting the Starting Frame of Actions in Video , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Mitesh M. Khapra,et al.  Efficient Video Classification Using Fewer Frames , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[35]  Shagan Sah,et al.  Key frame extraction for salient activity recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[38]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[40]  David S. Doermann,et al.  Tools and techniques for video performance evaluation , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[41]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[42]  Nicu Sebe,et al.  Optimized Graph Learning Using Partial Tags and Multiple Features for Image and Video Annotation , 2016, IEEE Transactions on Image Processing.

[43]  Anastasios Tefas,et al.  A salient dictionary learning framework for activity video summarization via key-frame extraction , 2018, Inf. Sci..

[44]  Antonio Bandera,et al.  Spatio-temporal feature-based keyframe detection from video shots using spectral clustering , 2013, Pattern Recognit. Lett..

[45]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[46]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[49]  Hwann-Tzong Chen,et al.  Attentive and Adversarial Learning for Video Summarization , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[50]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Michael Gygli,et al.  Efficient Object Annotation via Speaking and Pointing , 2019, International Journal of Computer Vision.

[52]  Cordelia Schmid,et al.  Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[53]  Chao Li,et al.  Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Shailendra S. Aote,et al.  An automatic video annotation framework based on two level keyframe extraction mechanism , 2018, Multimedia Tools and Applications.

[55]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[56]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[57]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[58]  Andrew Zisserman,et al.  Deep Insights into Convolutional Networks for Video Recognition , 2019, International Journal of Computer Vision.

[59]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[60]  Shijie Zhang,et al.  Deep Key Frame Extraction for Sport Training , 2017, CCCV.

[61]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[62]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[63]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[64]  Guofeng Zhang,et al.  Keyframe-based dense planar SLAM , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[65]  Larry S. Davis,et al.  AdaFrame: Adaptive Frame Selection for Fast Video Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Nicu Sebe,et al.  Optimal graph learning with partial tags and multiple features for image and video annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  G. S. Naveen Kumar,et al.  Key Frame Extraction Using Rough Set Theory for Video Retrieval , 2019 .

[68]  Maani Ghaffari Jadidi,et al.  A Keyframe-based Continuous Visual SLAM for RGB-D Cameras via Nonparametric Joint Geometric and Appearance Representation , 2019, ArXiv.

[69]  Jieping Ye,et al.  Two-Dimensional Linear Discriminant Analysis , 2004, NIPS.

[70]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Ananda S. Chowdhury,et al.  Video key frame extraction through dynamic Delaunay clustering with a structural constraint , 2013, J. Vis. Commun. Image Represent..

[72]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[73]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.