Generating open world descriptions of video using common sense knowledge in a pattern theory framework

The task of interpretation of activities as captured in video extends beyond just the recognition of observed actions and objects. It involves open world reasoning and constructing deep semantic connections that go beyond what is directly observed in the video and annotated in the training data. Prior knowledge plays a big role. Grenander’s canonical pattern theory representation offers an elegant mechanism to capture these semantic connections between what is observed directly in the image and past knowledge in large-scale common sense knowledge bases, such as ConceptNet. We represent interpretations using a connected structure of basic detected (grounded) concepts, such as objects and actions, that are bound by semantics with other background concepts not directly observed, i.e., contextualization cues. Concepts are basic generators and the bonds are defined by the semantic relationships between concepts. Local and global regularity constraints govern these bonds and the overall connection structure. We use an inference engine based on energy minimization using an efficient Markov Chain Monte Carlo that uses the ConceptNet in its move proposals to find these structures that describe the image content. Using four different publicly available large datasets, Charades, Microsoft Visual Description Corpus (MSVD), Breakfast Actions, and CMU Kitchen, we Received March 22, 2018, and, in revised form, October 12, 2018. 2010 Mathematics Subject Classification. Primary 54C40, 14E20; Secondary 46E25, 20C20.

[1]  Catherine Havasi,et al.  ConceptNet 5: A Large Semantic Network for Relational Knowledge , 2013, The People's Web Meets NLP.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sudeep Sarkar,et al.  An Inherently Explainable Model for Video Activity Interpretation , 2018, AAAI Workshops.

[5]  Michael I. Miller,et al.  Multiple target direction of arrival tracking , 1995, IEEE Trans. Signal Process..

[6]  Ramakant Nevatia,et al.  Hierarchical Language-based Representation of Events in Video Streams , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[7]  U. Grenander,et al.  Structural Image Restoration through Deformable Templates , 1991 .

[8]  Wei Zhang,et al.  Context, Computation, and Optimal ROC Performance in Hierarchical Models , 2011, International Journal of Computer Vision.

[9]  Sudeep Sarkar,et al.  Towards a Knowledge-Based Approach for Generating Video Descriptions , 2017, 2017 14th Conference on Computer and Robot Vision (CRV).

[10]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[11]  M I Miller,et al.  Mathematical textbook of deformable neuroanatomies. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Ulf Grenander,et al.  Hands: A Pattern Theoretic Study of Biological Shapes , 1990 .

[16]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[18]  Cees Snoek,et al.  Segment-based models for event detection and recounting , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[19]  D. Mumford Pattern theory: a unifying perspective , 1996 .

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Rama Chellappa,et al.  PADS: A Probabilistic Activity Detection Framework for Video Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Sudeep Sarkar,et al.  Pattern Theory-Based Interpretation of Activities , 2014, 2014 22nd International Conference on Pattern Recognition.

[24]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[25]  Sudeep Sarkar,et al.  Pattern theory for representation and inference of semantic structures in videos , 2016, Pattern Recognit. Lett..

[26]  Michael I. Miller,et al.  Pattern Theory: From Representation to Inference , 2007 .

[27]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[30]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Mark Johnson,et al.  Dynamic programming for parsing and estimation of stochastic unification-based grammars , 2002, ACL.

[33]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[34]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[35]  U. Grenander A Calculus of Ideas: A Mathematical Study of Human Thought , 2012 .

[36]  Elie Bienenstock,et al.  Compositionality, MDL Priors, and Object Recognition , 1996, NIPS.

[37]  U. Grenander,et al.  Computational anatomy: an emerging discipline , 1998 .

[38]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[39]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[40]  Sudeep Sarkar,et al.  Building semantic understanding beyond deep learning from sound and vision , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[41]  Qiang Chen,et al.  Contextualizing Object Detection and Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael I. Miller,et al.  Conditional-mean estimation via jump-diffusion processes in multiple target tracking/recognition , 1995, IEEE Trans. Signal Process..

[44]  Yali Amit,et al.  Graphical Templates for Model Registration , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Manuela Herman,et al.  Rethinking Context Language As An Interactive Phenomenon , 2016 .

[46]  Yi Wang,et al.  Contextualized Videos: Combining Videos with Environment Models to Support Situational Understanding , 2007, IEEE Transactions on Visualization and Computer Graphics.

[47]  Ulf Grenander,et al.  General Pattern Theory: A Mathematical Study of Regular Structures , 1993 .

[48]  Guoray Cai,et al.  Contextualization of Geospatial Database Semantics for Human–GIS Interaction , 2007, GeoInformatica.

[49]  Rama Chellappa,et al.  A Constrained Probabilistic Petri Net Framework for Human Activity Detection in Video* , 2008, IEEE Transactions on Multimedia.

[50]  Mohamed R. Amer,et al.  Monte Carlo Tree Search for Scheduling Activity Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[51]  Sudeep Sarkar,et al.  Temporally coherent interpretations for long videos using pattern theory , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[53]  Rama Chellappa,et al.  Recognition of Multi-Object Events Using Attribute Grammars , 2006, 2006 International Conference on Image Processing.

[54]  Anuj Srivastava,et al.  A Pattern-Theoretic Characterization of Biological Growth , 2007, IEEE Transactions on Medical Imaging.

[55]  Rama Chellappa,et al.  Activity Modeling Using Event Probability Sequences , 2008, IEEE Transactions on Image Processing.

[56]  Michael I. Miller,et al.  REPRESENTATIONS OF KNOWLEDGE IN COMPLEX SYSTEMS , 1994 .

[57]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[58]  Chitta Baral,et al.  From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge , 2015, ArXiv.

[59]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[60]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Sudeep Sarkar,et al.  Spatially Coherent Interpretations of Videos Using Pattern Theory , 2016, International Journal of Computer Vision.

[62]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.