Expectation grammars: leveraging high-level expectations for activity recognition

Video-based recognition and prediction of a temporally extended activity can benefit from a detailed description of high-level expectations about the activity. Stochastic grammars allow for an efficient representation of such expectations and are well-suited for the specification of temporally well-ordered activities. In this paper, we extend stochastic grammars by adding event parameters, state checks, and sensitivity to an internal scene model. We present an implemented system that uses human-specified grammars to recognize a person performing the Towers of Hanoi task from a video sequence by analyzing object interaction events. Experimental results from several videos show robust recognition of the full task and its constituent sub-tasks even though no appearance models of the objects in the video are provided. These experiments include videos of the task performed with different shaped objects and with distracting and extraneous interactions.

[1]  Stefan Posch,et al.  Analysis of Object Interactions in Dynamic Scenes , 2002, DAGM-Symposium.

[2]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[3]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Dimitris N. Metaxas,et al.  A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[5]  G. Medioni,et al.  Scenario Recognition in Airborne Video Imagery , 2003 .

[6]  Robert Givan,et al.  Learning temporal, relational, force-dynamic event definitions from video , 2002, AAAI/IAAI.

[7]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[8]  Irfan Essa,et al.  Recognizing Multitasked Activities using Stochastic Context-Free Grammar , 2001 .

[9]  Aaron F. Bobick,et al.  A State-Based Approach to the Representation and Recognition of Gesture , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Jeffrey Mark Siskind,et al.  Visual Event Classification via Force Dynamics , 2000, AAAI/IAAI.

[11]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[12]  Ramesh C. Jain,et al.  Recursive identification of gesture inputs using hidden Markov models , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[13]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Claudio S. Pinhanez,et al.  Representation and recognition of action in interactive spaces , 1999 .

[16]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Mubarak Shah,et al.  Motion-Based Recognition , 1997, Computational Imaging and Vision.