One-Shot SADI-EPE: A Visual Framework of Event Progress Estimation

In many practical engineering applications, the number of actions that have been finished should be known, particularly for an untrimmed video sequence that includes an event with a series of actions, it is important to know the number of actions that have been finished. In this paper, we termed this process as visual event progress estimation (EPE). However, the research related to this problem is few in the research community. To solve this problem, a visual human action analysis-based framework, namely one-shot simultaneously action detection and identification (SADI)-EPE, is presented in this paper. The visual EPE is modeled as an online one-shot learning-based problem; sliding window and attention-based bag of key poses formulate our framework. Unlike most of the action analysis methods relying on a number of training data of some predefined classes, our method can realize SADI for any event if one sample of the event is given, which makes it feasible for practical applications. At the same time, not only SADI but also the progress estimation of the event can be realized by our algorithm. In terms of methodology, the key pose is defined by an invariant pose descriptor from skeletal data and silhouette data. Moreover, in order to extract representative and discriminative poses from one training sample, we present a new bidirectional $k$ NN-based attention weighted key pose selection method, which can filter the unrelated actions and model different importance of various key poses. In addition, an attention-based multi-modal fusion scheme, which addresses the difficulty of high-dimensional features and few training samples, is proposed to augment the performance of our algorithm. Finally, we propose an evaluation criterion for the estimation problem. Extensive results demonstrated the efficacy of our proposed framework.

[1]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Cees Snoek,et al.  Online Action Detection , 2016, ECCV.

[6]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[8]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[10]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[12]  Ke Lu,et al.  Multiview Hessian regularized logistic regression for action recognition , 2015, Signal Process..

[13]  Masanori Takehara,et al.  The role of speech technology in service-operation estimation , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[14]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Venu Govindaraju,et al.  Language-motivated approaches to action recognition , 2013, J. Mach. Learn. Res..

[18]  Alexandros André Chaaraoui,et al.  Fusion of Skeletal and Silhouette-Based Features for Human Action Recognition with RGB-D Devices , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[19]  Jianxin Wu,et al.  A Tube-and-Droplet-Based Approach for Representing and Analyzing Motion Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Giorgio Metta,et al.  Keep it simple and sparse: real-time action recognition , 2013, J. Mach. Learn. Res..

[21]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[24]  Mario Fernando Montenegro Campos,et al.  On the improvement of human action recognition from depth map sequences using Space-Time Occupancy Patterns , 2014, Pattern Recognit. Lett..

[25]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[27]  Paulo Cortez,et al.  Automatic visual detection of human behavior: A review from 2000 to 2014 , 2015, Expert Syst. Appl..

[28]  Ke Lu,et al.  $p$-Laplacian Regularized Sparse Coding for Human Activity Recognition , 2016, IEEE Transactions on Industrial Electronics.

[29]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[30]  Jiaying Liu,et al.  PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding , 2017, ArXiv.

[31]  Jun Wan,et al.  Explore Efficient Local Features from RGB-D Data for One-Shot Learning Gesture Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[33]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[34]  Masanori Takehara,et al.  Service-Operation Estimation in a Japanese Restaurant Using Multi-Sensor and POS Data , 2011 .

[35]  Weiyao Lin,et al.  Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion , 2018, AAAI.

[36]  Norbert Link,et al.  Worker Behavior Interpretation for Flexible Production , 2009 .

[37]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[38]  Pichao Wang,et al.  Investigation of different skeleton features for CNN-based 3D action recognition , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[39]  Hong Liu,et al.  3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector , 2016, IJCAI.

[40]  Wei Li,et al.  CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016 , 2016, ArXiv.

[41]  Nuno Vasconcelos,et al.  VLAD3: Encoding Dynamics of Deep Features for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Qi Tian,et al.  Human Daily Action Analysis with Multi-view and Color-Depth Data , 2012, ECCV Workshops.

[43]  Wei-Shi Zheng,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[45]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[46]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Masanori Takehara,et al.  Multi-modal service operation estimation using DNN-based acoustic bag-of-features , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).