Time-ordered spatial-temporal interest points for human action classification

Human action classification, which is vital for content-based video retrieval and human-machine interaction, finds problem in distinguishing similar actions. Previous works typically detect spatial-temporal interest points (STIPs) from action sequences and then adopt bag-of-visual words (BoVW) model to describe actions as numerical statistics of STIPs. Despite the robustness of BoVW, this model ignores the spatial-temporal layout of STIPs, leading to misclassification among different types of actions with similar numerical statistics of STIPs. Motivated by this, a time-ordered feature is designed to describe the temporal distribution of STIPs, which contains complementary structural information to traditional BoVW model. Moreover, a temporal refinement method is used to eliminate intra-variations among time-ordered features caused by performers' habits. Then a time-ordered BoVW model is built to represent actions, which encodes both numerical statistics and temporal distribution of STIPs. Extensive experiments on three challenging datasets, i.e., KTH, Rochster and UT-Interaction, validate the effectiveness of our method in distinguishing similar actions.

[1]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[3]  Hong Liu,et al.  Learning directional co-occurrence for human action classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[5]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Hong Liu,et al.  Action classification by exploring directional co-occurrence of weighted stips , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[7]  Slawomir Bak,et al.  Relative dense tracklets for human action recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[8]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[9]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Detection and Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Nasser Kehtarnavaz,et al.  Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors , 2015, IEEE Transactions on Human-Machine Systems.

[12]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[14]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Kuldip K. Paliwal,et al.  Fast principal component analysis using fixed-point algorithm , 2007, Pattern Recognit. Lett..

[16]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[17]  Samsu Sempena,et al.  Human action recognition using Dynamic Time Warping , 2011, Proceedings of the 2011 International Conference on Electrical Engineering and Informatics.

[18]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[19]  Nasser Kehtarnavaz,et al.  A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion , 2016, IEEE Sensors Journal.

[20]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[21]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[22]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[23]  Gang Yu,et al.  Predicting human activities using spatio-temporal structure of interest points , 2012, ACM Multimedia.

[24]  Hong Liu,et al.  3D Action Recognition Using Multiscale Energy-Based Global Ternary Image , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Hong Liu,et al.  Depth Context: a new descriptor for human activity recognition by using sole depth sequences , 2016, Neurocomputing.

[26]  Florian Baumann,et al.  Recognizing human actions using novel space-time volume binary patterns , 2016, Neurocomputing.

[27]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  François Brémond,et al.  Contextual Statistics of Space-Time Ordered Features for Human Action Recognition , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[29]  Nicu Sebe,et al.  Cluster encoding for modelling temporal variation in video , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[30]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Jean-Michel Jolion,et al.  Pairwise Features for Human Action Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[32]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[33]  Nasser Kehtarnavaz,et al.  A survey of depth and inertial sensor fusion for human action recognition , 2015, Multimedia Tools and Applications.

[34]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..