Bag-of-words with aggregated temporal pair-wise word co-occurrence for human action recognition

Classic BoW (bag-of-words) just counts words and lacks spatio-temporal constraints.Proposed extension of BoW (t-BoW) considers aggregated temporal word co-occurrences.t-BoW is conceptually simpler than other existing BoW extensions.The BoW pipeline is altered minimally and no additional learning schemes are required.t-BoW is effective and outperforms plain BoW and other extensions. The bag-of-words (BoW) representation has successfully been used for human action recognition from videos. However, one limitation of the standard BoW is that it ignores spatial and temporal relationships between the visual words. Although several approaches have been proposed to deal with this issue, we propose an extension which is arguably simpler yet quite effective. The proposed representation, t-BoW, captures only temporal relationships between pairs of words in an aggregated way by counting co-occurrences at several temporal differences. Unlike other approaches, neither spatial nor hierarchical information is accounted for explicitly, and no significant change is required in the quantization or classification procedures. Performance improvements over the traditional BoW and other BoW extensions are experimentally observed in the KTH, the ADL, the Keck, and the HMDB51 action/gestures datasets.

[1]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[2]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  Ramesh C. Jain,et al.  Recursive identification of gesture inputs using hidden Markov models , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[7]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[8]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[9]  Filiberto Pla,et al.  Bag-of-Words for Action Recognition using Random Projections - An Exploratory Study , 2013, VISAPP.

[10]  Sheng-De Wang,et al.  Choosing the kernel parameters for support vector machines by the inter-cluster distance in the feature space , 2009, Pattern Recognit..

[11]  Zicheng Liu,et al.  Hierarchical Filtered Motion for Action Recognition in Crowded Videos , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Mubarak Shah,et al.  Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jean-Michel Jolion,et al.  Pairwise Features for Human Action Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[16]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[17]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[20]  Anil A. Bharath,et al.  Efficient Kernels Couple Visual Words Through Categorical Opponency , 2012, BMVC.

[21]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[23]  Bashar Tahayna,et al.  Human action detection and classification using optimal bag-of-words representation , 2010, 6th International Conference on Digital Content, Multimedia Technology and its Applications.

[24]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[25]  Alberto Del Bimbo,et al.  Effective Codebooks for Human Action Representation and Classification in Unconstrained Videos , 2012, IEEE Transactions on Multimedia.

[26]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[27]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Ling Shao,et al.  Spatio-temporal steerable pyramid for human action recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[29]  Shiliang Sun,et al.  A review of optimization methodologies in support vector machines , 2011, Neurocomputing.

[30]  Slawomir Bak,et al.  Relative dense tracklets for human action recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[31]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[32]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[34]  Gerhard Rigoll,et al.  Crane gesture recognition using pseudo 3-D hidden Markov models , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[35]  Chabane Djeraba,et al.  Action Recognition Using Direction Models of Motion , 2010, 2010 20th International Conference on Pattern Recognition.

[36]  Thomas B. Moeslund,et al.  Selective spatio-temporal interest points , 2012, Comput. Vis. Image Underst..

[37]  Masamichi Shimosaka,et al.  Hierarchical recognition of daily human actions based on Continuous Hidden Markov Models , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[38]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  François Brémond,et al.  Evaluation of Local Descriptors for Action Recognition in Videos , 2011, ICVS.

[41]  Yui Man Lui,et al.  Human gesture recognition on product manifolds , 2012, J. Mach. Learn. Res..

[42]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[43]  Larry S. Davis,et al.  Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  François Brémond,et al.  Statistics of Pairwise Co-occurring Local Spatio-temporal Features for Human Action Recognition , 2012, ECCV Workshops.

[45]  Nicu Sebe,et al.  Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation , 2013, ICIAP.

[46]  Hassan Foroosh,et al.  Action recognition using rank-1 approximation of Joint Self-Similarity Volume , 2011, 2011 International Conference on Computer Vision.

[47]  Cécile Barat,et al.  Spatial orientations of visual word pairs to improve Bag-of-Visual-Words model , 2012, BMVC.

[48]  Ying Wu,et al.  Action recognition with multiscale spatio-temporal contexts , 2011, CVPR 2011.

[49]  Manuel J. Marín-Jiménez,et al.  Fitting Product of HMM to Human Motions , 2009, CAIP.

[50]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Qing-Bin Gao,et al.  Human activity recognition with beta process hidden Markov models , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[53]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[54]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[55]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[56]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.