Deep Keyframe Detection in Human Action Videos

Detecting representative frames in videos based on human actions is quite challenging because of the combined factors of human pose in action and the background. This paper addresses this problem and formulates the key frame detection as one of finding the video frames that optimally maximally contribute to differentiating the underlying action category from all other categories. To this end, we introduce a deep two-stream ConvNet for key frame detection in videos that learns to directly predict the location of key frames. Our key idea is to automatically generate labeled data for the CNN learning using a supervised linear discriminant method. While the training data is generated taking many different human action videos into account, the trained CNN can predict the importance of frames from a single video. We specify a new ConvNet framework, consisting of a summarizer and discriminator. The summarizer is a two-stream ConvNet aimed at, first, capturing the appearance and motion features of video frames, and then encoding the obtained appearance and motion features for video representation. The discriminator is a fitting function aimed at distinguishing between the key frames and others in the video. We conduct experiments on a challenging human action dataset UCF101 and show that our method can detect key frames with high accuracy.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[3]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Kin-Man Lam,et al.  A new key frame representation for video segment retrieval , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Shagan Sah,et al.  Key frame extraction for salient activity recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[16]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Tianming Liu,et al.  A novel video key-frame-extraction algorithm based on perceived motion energy model , 2003, IEEE Trans. Circuits Syst. Video Technol..

[23]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[25]  Luc Van Gool,et al.  Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Antonio Bandera,et al.  Spatio-temporal feature-based keyframe detection from video shots using spectral clustering , 2013, Pattern Recognit. Lett..

[29]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[30]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[31]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[34]  Fei-Fei Li,et al.  Improving Image Classification with Location Context , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Hiroshi Nagahashi,et al.  Real-time Action Recognition Based on Key Frame Detection , 2017, ICMLC.

[36]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[37]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[38]  Aristidis Likas,et al.  Weighted multi-view key-frame extraction , 2016, Pattern Recognit. Lett..

[39]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[40]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).