Early Action Recognition With Category Exclusion Using Policy-Based Reinforcement Learning

The goal of early action recognition is to predict action label when the sequence is partially observed. The existing methods treat the early action recognition task as sequential classification problems on different observation ratios of an action sequence. Since these models are trained by differentiating positive category from all negative classes, the diverse information of different negative categories is ignored, which we believe can be collected to help improve the recognition performance. In this paper, we step towards to a new direction by introducing category exclusion to early action recognition. We model the exclusion as a mask operation on the classification probability output of a pre-trained early action recognition classifier. Specifically, we use policy-based reinforcement learning to train an agent. The agent generates a series of binary masks to exclude interfering negative categories during action execution and hence help improve the recognition accuracy. The proposed method is evaluated on three benchmark recognition datasets, NTU-RGBD, First-Person Hand Action, as well as UCF-101. The proposed method enhances the recognition accuracy consistently over all different observation ratios on the three datasets, where the accuracy improvements on the early stages are especially significant.

[1]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[5]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[6]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[7]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[10]  Gregory D. Hager,et al.  Temporal Convolutional Networks: A Unified Approach to Action Segmentation , 2016, ECCV Workshops.

[11]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[12]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[13]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[15]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[17]  Sinisa Todorovic,et al.  Temporal Deformable Residual Networks for Action Segmentation in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[19]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[23]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[24]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Junsong Yuan,et al.  Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ser-Nam Lim,et al.  Adaptive RNN Tree for Large-Scale Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Bing Li,et al.  Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition , 2016, ECCV.

[30]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Junsong Yuan,et al.  Discriminative Spatio-Temporal Pattern Discovery for 3D Action Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[35]  Junsong Yuan,et al.  Recognizing Human Actions as the Evolution of Pose Estimation Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jian Yang,et al.  Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition , 2018, AAAI.

[37]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[39]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[40]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Junsong Yuan,et al.  Robust Part-Based Hand Gesture Recognition Using Kinect Sensor , 2013, IEEE Transactions on Multimedia.

[43]  Hongsong Wang,et al.  Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[47]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[48]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[49]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Junsong Yuan,et al.  Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera , 2011, ACM Multimedia.

[53]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Xudong Jiang,et al.  Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition , 2018, ECCV.

[56]  Nadia Magnenat-Thalmann,et al.  AR in Hand: Egocentric Palm Pose Tracking and Gesture Recognition for Augmented Reality Applications , 2015, ACM Multimedia.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Gang Yu,et al.  Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction , 2014, ACCV.

[59]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[60]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[63]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[65]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[66]  Guillermo Garcia-Hernando,et al.  Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).