A Cognitive System for Understanding Human Manipulation Actions

This paper describes the architecture of a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and that includes interacting modules for perception and reasoning. Our work contributes to two core problems at the heart of action understanding: (a) the grounding of relevant information about actions in perception (the perception-action integration problem), and (b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Grammar, a context-free grammar that organizes actions as a sequence of sub events. Each sub event is described by the hand, movements, objects and tools involved, and the relevant information about these factors is obtained from biologicallyinspired perception modules. These modules track the hands and objects, and they recognize the hand grasp, objects and actions using attention, segmentation, and feature description. Experiments on a new data set of manipulation actions show that our system extracts the relevant visual information and semantic representation. This representation could further be used by the cognitive agent for reasoning, prediction, and planning.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Douglas Summers-Stay,et al.  Using a minimal action grammar for activity understanding in the real world , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  George J. Pappas,et al.  Hybrid Controllers for Path Planning: A Temporal Logic Approach , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[4]  Noam Chomsky Lectures on Government and Binding: The Pisa Lectures , 1993 .

[5]  Yiannis Aloimonos,et al.  Minimalist plans for interpreting manipulation actions , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[8]  Gregory D. Hager,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.

[9]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[11]  S. J. Keyser Linguistic inquiry monographs , 1976 .

[12]  Liang Wang,et al.  Learning and Matching of Dynamic Shape Manifolds for Human Action Recognition , 2007, IEEE Transactions on Image Processing.

[13]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[15]  Yiannis Aloimonos,et al.  Detection of Manipulation Action Consequences (MAC) , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  H. Cotta [On the physiology of joints]. , 1966, Langenbecks Archiv fur Chirurgie.

[17]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[18]  Ales Ude,et al.  A Simple Ontology of Manipulation Actions Based on Hand-Object Relations , 2013, IEEE Transactions on Autonomous Mental Development.

[19]  Frank Nuessel,et al.  X̄ syntax: a study of phrase structure. Linguistic Inquiry Monograph Two: Ray Jackendoff, MIT Press, Cambridge, Mass., 1977. xii, 249 pp. $11.50 (subscribers to Linguistic Inquiry pay $9.50). , 1979 .

[20]  Ales Ude,et al.  Toward a library of manipulation actions based on semantic object-action relations , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Rama Chellappa,et al.  Identification of humans using gait , 2004, IEEE Transactions on Image Processing.

[22]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[24]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Shyamsundar Rajaram,et al.  Human Activity Recognition Using Multidimensional Indexing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Daniel DeMenthon,et al.  The image torque operator: A new tool for mid-level vision , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ray Jackendoff,et al.  X Syntax: A Study of Phrase Structure , 1980 .

[28]  Yiannis Aloimonos,et al.  Towards a Watson that sees: Language-guided action recognition for robots , 2012, 2012 IEEE International Conference on Robotics and Automation.

[29]  Matthew Brand,et al.  Understanding manipulation in video , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[30]  Neil T. Dantam,et al.  The Motion Grammar: Analysis of a Linguistic Method for Robot Control , 2013, IEEE Transactions on Robotics.

[31]  Payam Saisan,et al.  Dynamic texture recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[32]  Yi Li,et al.  Learning shift-invariant sparse representation of actions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Mark Steedman,et al.  Plans, Affordances, And Combinatory Grammar , 2002 .

[34]  Bohyung Han,et al.  Visual Tracking by Continuous Density Propagation in Sequential Bayesian Filtering Framework , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yi Li,et al.  Extraction of parametric human model for posture recognition using genetic algorithm , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[36]  E. Mackin,et al.  Examination of the hand and wrist , 1998 .

[37]  Larry Tesler,et al.  A Conceptual Dependency Parser for Natural Language , 1969, COLING.

[38]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[39]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[40]  Y. Aloimonos,et al.  Discovering a Language for Human Activity 1 , 2005 .

[41]  John K. Tsotsos Analyzing vision at the complexity level , 1990, Behavioral and Brain Sciences.

[42]  James A. Hendler,et al.  Languages, behaviors, hybrid architectures, and motion control , 1998 .

[43]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[44]  Yiannis Aloimonos,et al.  The minimalist grammar of action , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.