Adaptive Temporal Segmentation for Action Recognition

Learning deep representations have been widely used in action recognition task. However, the features extracted by deep convolutional neural networks (CNNs) have many redundant information. This paper aims to discover the relevance between temporal features extracted by CNNs. Different fromTemporal Segment Networks (TSN) to randomly select video clips. Based on the matrix-based Rényi’s α-entropy, we estimate the mutual information between temporal domain features. Through our experiments, we propose an adaptive temporal segmentation scheme to represent the entire videos. We also combine the features of RGB and optical flow frames extracted by 3D ConvNets to verify the complementary information between them. We show that the proposed approach achieves 94.4 and 72.8 percent accuracy, in the UCF- 101 and HMDB-51 datasets.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  Ling Shao,et al.  Action Recognition From Arbitrary Views Using Transferable Dictionary Learning , 2018, IEEE Transactions on Image Processing.

[5]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[9]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Jun Wang,et al.  Feature Selection by Maximizing Independent Classification Information , 2017, IEEE Transactions on Knowledge and Data Engineering.

[14]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[15]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[17]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[18]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chunfeng Yuan,et al.  Multi-task Sparse Learning with Beta Process Prior for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[24]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[25]  Mubarak Shah,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[28]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[30]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[31]  Ling Shao,et al.  Arbitrary view action recognition via transfer dictionary learning on synthetic training data , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[32]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Ling Shao,et al.  Action Recognition Using 3D Histograms of Texture and A Multi-Class Boosting Classifier , 2017, IEEE Transactions on Image Processing.

[34]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[36]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[37]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Mubarak Shah,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.