ActionXPose: A Novel 2D Multi-view Pose-based Algorithm for Real-time Human Action Recognition

We present ActionXPose, a novel 2D pose-based algorithm for posture-level Human Action Recognition (HAR). The proposed approach exploits 2D human poses provided by OpenPose detector from RGB videos. ActionXPose aims to process poses data to be provided to a Long Short-Term Memory Neural Network and to a 1D Convolutional Neural Network, which solve the classification problem. ActionXPose is one of the first algorithms that exploits 2D human poses for HAR. The algorithm has real-time performance and it is robust to camera movings, subject proximity changes, viewpoint changes, subject appearance changes and provide high generalization degree. In fact, extensive simulations show that ActionXPose can be successfully trained using different datasets at once. State-of-the-art performance on popular datasets for posture-related HAR problems (i3DPost, KTH) are provided and results are compared with those obtained by other methods, including the selected ActionXPose baseline. Moreover, we also proposed two novel datasets called MPOSE and ISLD recorded in our Intelligent Sensing Lab, to show ActionXPose generalization performance.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Teuvo Kohonen,et al.  Self-Organizing Maps, Third Edition , 2001, Springer Series in Information Sciences.

[4]  A. Savakis,et al.  Multi-view action classification using sparse representations on Motion History Images , 2012, 2012 Western New York Image Processing Workshop.

[5]  Anupam Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2012, The Visual Computer.

[6]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[7]  Raquel Almeida,et al.  Exploring quantization error to improve human action classification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[8]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Lihong Zheng,et al.  A Survey on Human Action Recognition Using Depth Sensors , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[14]  Ioannis A. Kakadiaris,et al.  Matching mixtures of curves for human action recognition , 2014, Comput. Vis. Image Underst..

[15]  Alexandros Iosifidis,et al.  Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis , 2013, Signal Process..

[16]  Sergio Escalera,et al.  RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[17]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[18]  Marc Van Droogenbroeck,et al.  ViBe: A Universal Background Subtraction Algorithm for Video Sequences , 2011, IEEE Transactions on Image Processing.

[19]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[20]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Otmar Loffeld,et al.  Depth Camera Technology Comparison and Performance Evaluation , 2012, ICPRAM.

[23]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  Jorge Martinez-Carballido,et al.  Real Time Human Action Recognition Using Full and Ultra High Definition Video , 2015, 2015 International Conference on Computational Science and Computational Intelligence (CSCI).

[25]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Hakil Kim,et al.  Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN , 2017, ArXiv.

[28]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Hilke Kieritz,et al.  Hierarchical Hough forests for view-independent action recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[31]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[32]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[33]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jae-Bok Song,et al.  Data augmentation using synthesized images for object detection , 2017, 2017 17th International Conference on Control, Automation and Systems (ICCAS).

[35]  Yu Zhang,et al.  Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[36]  Ioannis Pitas,et al.  The i3DPost Multi-View and 3D Human Action/Interaction Database , 2009, 2009 Conference for Visual Media Production.

[37]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[38]  Bernadette Dorizzi,et al.  Two-layer discriminative model for human activity recognition , 2016, IET Comput. Vis..

[39]  Sergio A. Velastin,et al.  3D-Hog Embedding Frameworks for Single and Multi-Viewpoints Action Recognition Based on Human Silhouettes , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[41]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[42]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[44]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[46]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[47]  Shuang Wang,et al.  A Review on Human Activity Recognition Using Vision-Based Method , 2017, Journal of healthcare engineering.

[48]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[51]  Houshang Darabi,et al.  Multivariate LSTM-FCNs for Time Series Classification , 2018, Neural Networks.

[52]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[53]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.