A “string of feature graphs” model for recognition of complex activities in natural videos

Videos usually consist of activities involving interactions between multiple actors, sometimes referred to as complex activities. Recognition of such activities requires modeling the spatio-temporal relationships between the actors and their individual variabilities. In this paper, we consider the problem of recognition of complex activities in a video given a query example. We propose a new feature model based on a string representation of the video which respects the spatio-temporal ordering. This ordered arrangement of local collections of features (e.g., cuboids, STIP), which are the characters in the string, are initially matched using graph-based spectral techniques. Final recognition is obtained by matching the string representations of the query and the test videos in a dynamic programming framework which allows for variability in sampling rates and speed of activity execution. The method does not require tracking or recognition of body parts, is able to identify the region of interest in a cluttered scene, and gives reasonable performance with even a single query example. We test our approach in an example-based video retrieval framework with two publicly available complex activity datasets and provide comparisons against other methods that have studied this problem.

[1]  Peyman Milanfar,et al.  Detection of human actions from a single example , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[3]  P. Andersen Nonverbal Communication: Forms and Functions , 1998 .

[4]  Ricky J. Sethi,et al.  Modeling and Recognition of Complex Human Activities , 2011, Visual Analysis of Humans.

[5]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[6]  Michael S. Ryoo,et al.  One video is sufficient? Human activity recognition using active video composition , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[8]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[14]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[16]  Larry S. Davis,et al.  Event Modeling and Recognition Using Markov Logic Networks , 2008, ECCV.

[17]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[18]  Amit K. Roy-Chowdhury,et al.  Query-based retrieval of complex activities using “strings of motion-words” , 2009, 2009 Workshop on Motion and Video Computing (WMVC).

[19]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Bir Bhanu,et al.  VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication , 2011 .

[24]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[25]  Ricky J. Sethi,et al.  Motion Pattern Analysis for Modeling and Recognition of Complex Human Activities , 2011 .