Visual Commonsense for Scene Understanding Using Perception, Semantic Parsing and Reasoning

In this paper we explore the use of visual commonsense knowledge and other kinds of knowledge (such as domain knowledge, background knowledge, linguistic knowledge) for scene understanding. In particular, we combine visual processing with techniques from natural language understanding (especially semantic parsing), common-sense reasoning and knowledge representation and reasoning to improve visual perception to reason about finer aspects of activities.

[1]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Yiannis Aloimonos,et al.  Towards a Watson that sees: Language-guided action recognition for robots , 2012, 2012 IEEE International Conference on Robotics and Automation.

[4]  Yi Li,et al.  Learning shift-invariant sparse representation of actions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[6]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[7]  Matthai Philipose,et al.  Common Sense Based Joint Training of Human Activity Recognizers , 2007, IJCAI.

[8]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Chitta Baral,et al.  Knowledge Representation, Reasoning and Declarative Problem Solving , 2003 .

[10]  Matthai Philipose,et al.  Unsupervised Activity Recognition Using Automatically Mined Common Sense , 2005, AAAI.

[11]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[12]  Douglas Summers-Stay,et al.  Using a minimal action grammar for activity understanding in the real world , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Robert P. Goldman,et al.  A Bayesian Model of Plan Recognition , 1993, Artif. Intell..

[14]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[17]  Yiannis Aloimonos,et al.  Detection of Manipulation Action Consequences (MAC) , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Henry A. Kautz A formal theory of plan recognition , 1987 .

[19]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[20]  Shyamsundar Rajaram,et al.  Human Activity Recognition Using Multidimensional Indexing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[22]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[23]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[24]  Luc De Raedt,et al.  On the implementation of the probabilistic logic programming language ProbLog , 2010, Theory and Practice of Logic Programming.

[25]  Jean-Christophe Nebel,et al.  Common-Sense Knowledge for a Computer Vision System for Human Action Recognition , 2012, IWAAL.

[26]  J. Nelson Rushton,et al.  Probabilistic reasoning with answer sets , 2004, Theory and Practice of Logic Programming.

[27]  Bohyung Han,et al.  Visual Tracking by Continuous Density Propagation in Sequential Bayesian Filtering Framework , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Antonio Torralba,et al.  Inferring the Why in Images , 2014, ArXiv.

[29]  Yi Li,et al.  Extraction of parametric human model for posture recognition using genetic algorithm , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[30]  Sandra Carberry,et al.  Techniques for Plan Recognition , 2001, User Modeling and User-Adapted Interaction.

[31]  Yiannis Aloimonos,et al.  Active segmentation for robotics , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Song-Chun Zhu,et al.  Visual Persuasion: Inferring Communicative Intents of Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Yiannis Aloimonos,et al.  A Cognitive System for Understanding Human Manipulation Actions , 2014 .

[36]  Y. Aloimonos,et al.  Discovering a Language for Human Activity 1 , 2005 .

[37]  Liang Wang,et al.  Learning and Matching of Dynamic Shape Manifolds for Human Action Recognition , 2007, IEEE Transactions on Image Processing.

[38]  E. Mackin,et al.  Examination of the hand and wrist , 1998 .

[39]  Jean-Christophe Nebel,et al.  Common-sense reasoning for human action recognition , 2013, Pattern Recognit. Lett..

[40]  Rama Chellappa,et al.  Identification of humans using gait , 2004, IEEE Transactions on Image Processing.

[41]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).