Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use of randomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

[1]  Irfan A. Essa,et al.  Player localization using multiple static cameras for sports visualization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Gernot A. Fink,et al.  Markov Models for Pattern Recognition: From Theory to Applications , 2007 .

[3]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[4]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Irfan A. Essa,et al.  A novel sequence representation for unsupervised analysis of human activities , 2009, Artif. Intell..

[7]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[8]  Rahul Sukthankar,et al.  Exploiting multi-level parallelism for low-latency activity recognition in streaming video , 2010, MMSys '10.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[13]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[14]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[15]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[16]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[17]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Anthony Hoogs,et al.  Content-Based Retrieval of Functional Objects in Video Using Scene Context , 2010, ECCV.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[21]  Gernot A. Fink,et al.  Markov Models for Pattern Recognition , 2014, Advances in Computer Vision and Pattern Recognition.

[22]  Vincent Lepetit,et al.  Keypoint recognition using randomized trees , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  R. Reznick,et al.  Objective structured assessment of technical skill (OSATS) for surgical residents , 1997, The British journal of surgery.

[25]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.