An Unsupervised Feature learning and clustering method for key frame extraction on human action recognition

Recognizing Human action in video is an very active research topic. There are a growing variety of human action datasets with different video length, different practitioners. Make human action recognition becomes a very difficult topic. A majority researchers solve the problem by extracting key frames from the videos. Most paper use feature Clustering methods to extract key frames in videos. On one hand, the large variety of visual content in videos make handcraft feature isn't effective enough, since there are no fixed descriptors can describe all video cases. On the other hand, traditional clustering algorithms are easily influenced by the choice of initial clustering centers. An Unsupervised feature learning and clustering method for key frame extraction is proposed in this paper, which can be used for human action recognition. Stacked auto-encoder(SAE) is trained using videos from 10 different human actions, SAE is used as a feature extractor to learn features representing human actions. Affinity Propagation Clustering algorithm is used to select key frames from video sequences. We use a variety of videos to do the experiments. Experiments demonstrate that our method can be effectively summarizing video shots considering different human actions.

[1]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[2]  Ioannis Pitas,et al.  Information theory-based shot cut/fade detection and video summarization , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Mahmudul Hassan Ashik,et al.  Auto-Encoder Based Nonlinear Dimensionality Reduction of ECG data and Classification of Cardiac Arrhythmia Groups Using Deep Neural Network , 2018, 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT).

[4]  Yongxiong Wang,et al.  A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition , 2015, J. Vis. Commun. Image Represent..

[5]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6]  Rongrong Ji,et al.  Sparse auto-encoder based feature learning for human body detection in depth image , 2015, Signal Process..

[7]  JiRongrong,et al.  Sparse auto-encoder based feature learning for human body detection in depth image , 2015 .

[8]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[11]  Antonio Bandera,et al.  Spatio-temporal feature-based keyframe detection from video shots using spectral clustering , 2013, Pattern Recognit. Lett..

[12]  Richa Singh,et al.  Group sparse autoencoder , 2017, Image Vis. Comput..

[13]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[14]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.