An Extended Grammar System for Learning and Recognizing Complex Visual Events

For a grammar-based approach to the recognition of visual events, there are two major limitations that prevent it from real application. One is that the event rules are predefined by domain experts, which means huge manual cost. The other is that the commonly used grammar can only handle sequential relations between subevents, which is inadequate to recognize more complex events involving parallel subevents. To solve these problems, we propose an extended grammar approach to modeling and recognizing complex visual events. First, motion trajectories as original features are transformed into a set of basic motion patterns of a single moving object, namely, primitives (terminals) in the grammar system. Then, a Minimum Description Length (MDL) based rule induction algorithm is performed to discover the hidden temporal structures in primitive stream, where Stochastic Context-Free Grammar (SCFG) is extended by Allen's temporal logic to model the complex temporal relations between subevents. Finally, a Multithread Parsing (MTP) algorithm is adopted to recognize interesting complex events in a given primitive stream, where a Viterbi-like error recovery strategy is also proposed to handle large-scale errors, e.g., insertion and deletion errors. Extensive experiments, including gymnastic exercises, traffic light events, and multi-agent interactions, have been executed to validate the effectiveness of the proposed approach.

[1]  Qing Chen,et al.  Hand Gesture Recognition Using Haar-Like Features and a Stochastic Context-Free Grammar , 2008, IEEE Transactions on Instrumentation and Measurement.

[2]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Enrique Vidal,et al.  Efficient Error-Correcting Viterbi Parsing , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[7]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[8]  A. Sugimoto,et al.  Deleted Interpolation Using a Hierarchical Bayesian Grammar Network for Recognizing Human Activity , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  Irfan A. Essa,et al.  Expectation grammars: leveraging high-level expectations for activity recognition , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Rama Chellappa,et al.  Attribute Grammar-Based Event Recognition and Anomaly Detection , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[12]  Thomas S. Huang,et al.  CVIU special issue on event detection in video , 2004, Comput. Vis. Image Underst..

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  Tieniu Tan,et al.  A system for learning statistical motion patterns , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[16]  Mubarak Shah,et al.  A differential geometric approach to representing the human actions , 2008, Comput. Vis. Image Underst..

[17]  François Brémond,et al.  An APRIORI-based Method for Frequent Composite Event Discovery in Videos , 2006, Fourth IEEE International Conference on Computer Vision Systems (ICVS'06).

[18]  Anthony G. Cohn,et al.  Modeling Interaction Using Learnt Qualitative Spatio-Temporal Relations and Variable Length Markov Models , 2002, ECAI.

[19]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[20]  Quan Pan,et al.  Real-time multiple objects tracking with occlusion handling in dynamic scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[22]  Svetha Venkatesh,et al.  Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Michael Johnston,et al.  Unification-based Multimodal Parsing , 1998, ACL.

[24]  Peter Grünwald,et al.  A minimum description length approach to grammar inference , 1995, Learning for Natural Language Processing.

[25]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994 .

[27]  Deb Roy,et al.  Mining temporal patterns of movement for video content classification , 2006, MIR '06.

[28]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[29]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[30]  O. Firschein,et al.  Syntactic pattern recognition and applications , 1983, Proceedings of the IEEE.

[31]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[32]  Frank Klawonn,et al.  Finding informative rules in interval sequences , 2001, Intell. Data Anal..

[33]  Taisuke Sato,et al.  Bayesian classification of task-oriented actions based on stochastic context-free grammar , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[34]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[35]  Ramakant Nevatia,et al.  Hierarchical Language-based Representation of Events in Video Streams , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[36]  Yan Huang,et al.  Propagation networks for recognition of partially ordered sequential action , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[37]  Anthony G. Cohn,et al.  Protocols from perceptual observations , 2005, Artif. Intell..

[38]  Mubarak Shah,et al.  Learning, detection and representation of multi-agent events in videos , 2007, Artif. Intell..

[39]  Tieniu Tan,et al.  Trajectory Series Analysis based Event Rule Induction for Visual Surveillance , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Fatih Murat Porikli,et al.  Event Detection by Eigenvector Decomposition Using Object and Frame Features , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[42]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Marcel Worring,et al.  Multimedia event-based video indexing using time intervals , 2005, IEEE Transactions on Multimedia.

[46]  Tieniu Tan,et al.  Complex Activity Representation and Recognition by Extended Stochastic Grammar , 2006, ACCV.

[47]  Tieniu Tan,et al.  Multi-thread Parsing for Recognizing Complex Events in Videos , 2008, ECCV.

[48]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception by Hierarchical Bayesian Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.