Spatially Coherent Interpretations of Videos Using Pattern Theory

Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Krishna Sandeep Reddy Dubba Learning relational event models from videos , 2012 .

[3]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.

[4]  Lea Fleischer,et al.  General Pattern Theory A Mathematical Study Of Regular Structures , 2016 .

[5]  Anthony G. Cohn,et al.  Learning Relational Event Models from Video , 2015, J. Artif. Intell. Res..

[6]  Michael I. Miller,et al.  Pattern Theory: From Representation to Inference , 2007 .

[7]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[9]  Alan Fern,et al.  Probabilistic event logic for interval-based event recognition , 2011, CVPR 2011.

[10]  Ulf Grenander,et al.  General Pattern Theory: A Mathematical Study of Regular Structures , 1993 .

[11]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Rama Chellappa,et al.  A Constrained Probabilistic Petri Net Framework for Human Activity Detection in Video* , 2008, IEEE Transactions on Multimedia.

[13]  Larry S. Davis,et al.  Representation and Recognition of Events in Surveillance Video Using Petri Nets , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[14]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Jake K. Aggarwal,et al.  Robust Human-Computer Interaction System Guiding a User by Providing Feedback , 2007, IJCAI.

[19]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Rama Chellappa,et al.  Recognition of Multi-Object Events Using Attribute Grammars , 2006, 2006 International Conference on Image Processing.

[21]  Qiang Ji,et al.  Video event recognition with deep hierarchical context model , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  François Brémond,et al.  Probabilistic Recognition of Complex Event , 2011, ICVS.

[23]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Rama Chellappa,et al.  PADS: A Probabilistic Activity Detection Framework for Video Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jeffrey Mark Siskind,et al.  Seeing What You're Told: Sentence-Guided Activity Recognition in Video , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Sudeep Sarkar,et al.  Pattern Theory-Based Interpretation of Activities , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Mohamed R. Amer,et al.  Monte Carlo Tree Search for Scheduling Activity Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Sudeep Sarkar,et al.  Temporally coherent interpretations for long videos using pattern theory , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[34]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[37]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.