Learning the semantics of object–action relations by observation

Recognizing manipulations performed by a human and the transfer and execution of this by a robot is a difficult problem. We address this in the current study by introducing a novel representation of the relations between objects at decisive time points during a manipulation. Thereby, we encode the essential changes in a visual scenery in a condensed way such that a robot can recognize and learn a manipulation without prior object knowledge. To achieve this we continuously track image segments in the video and construct a dynamic graph sequence. Topological transitions of those graphs occur whenever a spatial relation between some segments has changed in a discontinuous way and these moments are stored in a transition matrix called the semantic event chain (SEC). We demonstrate that these time points are highly descriptive for distinguishing between different manipulations. Employing simple sub-string search algorithms, SECs can be compared and type-similar manipulations can be recognized with high confidence. As the approach is generic, statistical learning can be used to find the archetypal SEC of a given manipulation class. The performance of the algorithm is demonstrated on a set of real videos showing hands manipulating various objects and performing different actions. In experiments with a robotic arm, we show that the SEC can be learned by observing human manipulations, transferred to a new scenario, and then reproduced by the machine.

[1]  J. Stevens,et al.  Animal Intelligence , 1883, Nature.

[2]  John McCarthy,et al.  SOME PHILOSOPHICAL PROBLEMS FROM THE STANDPOINT OF ARTI CIAL INTELLIGENCE , 1987 .

[3]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[4]  Emanuele Trucco,et al.  Geometric Invariance in Computer Vision , 1995 .

[5]  David J. Kriegman,et al.  What is the set of images of an object under all possible lighting conditions? , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[7]  Katsushi Ikeuchi,et al.  Modeling manipulation interactions by hidden Markov models , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  C. Breazeal,et al.  Robots that imitate humans , 2002, Trends in Cognitive Sciences.

[9]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[10]  Antonio Torralba,et al.  Modeling global scene factors in attention. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[11]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[12]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[14]  Aude Billard,et al.  Stochastic gesture production and recognition model for a humanoid robot , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[15]  Somboon Hongeng Unsupervised Learning of Multi-Object Events , 2004, BMVC.

[16]  Somboon Hongeng Unsupervised Learning of Multi-Object Event Classes ∗ , 2004 .

[17]  G. Rizzolatti,et al.  The mirror-neuron system. , 2004, Annual review of neuroscience.

[18]  Aude Billard,et al.  Extended Hopfield Network for Sequence Learning: Application to Gesture Recognition , 2005, ICANN.

[19]  Mubarak Shah,et al.  Multiple Agent Event Detection and Representation in Videos , 2005, AAAI.

[20]  Henry A. Kautz,et al.  Location-Based Activity Recognition using Relational Markov Networks , 2005, IJCAI.

[21]  Aude Billard,et al.  Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM , 2005, ICML.

[22]  Hiroshi Murase,et al.  Visual learning and recognition of 3-d objects from appearance , 2005, International Journal of Computer Vision.

[23]  Joseph L. Mundy,et al.  Object Recognition in the Geometric Era: A Retrospective , 2006, Toward Category-Level Object Recognition.

[24]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Cordelia Schmid,et al.  Toward Category-Level Object Recognition , 2006, Toward Category-Level Object Recognition.

[26]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[28]  Aude Billard,et al.  Incremental learning of gestures by imitation in a humanoid robot , 2007, 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[29]  S. Harnad Symbol grounding problem , 1991, Scholarpedia.

[30]  Danica Kragic,et al.  Action recognition and understanding through motor primitives , 2007, Adv. Robotics.

[31]  Fabio Solari,et al.  Compact (and accurate) early vision processing in the harmonic space , 2007, VISAPP.

[32]  Daniel Grest,et al.  Human Action Recognition in Table-Top Scenarios : An HMM-Based Analysis to Optimize the Performance , 2007, CAIP.

[33]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[35]  Danica Kragic,et al.  Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[36]  三嶋 博之 The theory of affordances , 2008 .

[37]  Miquel Ferrer Sumsi Theory and Algorithms on the Median Graph: application to Graph-Based Classification and Clustering , 2008 .

[38]  Henry A. Kautz,et al.  Improving the recognition of interleaved activities , 2008, UbiComp.

[39]  Anthony G. Cohn,et al.  Learning Functional Object-Categories from a Relational Spatio-Temporal Representation , 2008, ECAI.

[40]  Marc M. Van Hulle,et al.  Realtime phase-based optical flow on the GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[41]  Babette Dellen,et al.  Ascertaining relevant changes in visual data by interfacing AI reasoning and low-level visual information via temporally stable image segments data , 2008 .

[42]  M. Kiefer,et al.  Action observation can prime visual object recognition , 2009, Experimental Brain Research.

[43]  Babette Dellen,et al.  Disparity from Stereo-segment Silhouettes of Weakly-textured Images , 2009, BMVC.

[44]  Francesc d'ASSIS Serratosa Casanelles,et al.  Theory and algorithms on the median graph , 2009 .

[45]  Florentin Wörgötter,et al.  Cognitive agents - a procedural perspective relying on the predictability of Object-Action-Complexes (OACs) , 2009, Robotics Auton. Syst..

[46]  Geoffrey E. Hinton,et al.  Learning Generative Texture Models with extended Fields-of-Experts , 2009, BMVC.

[47]  Eren Erdal Aksoy,et al.  Segment Tracking via a Spatiotemporal Linking Process including Feedback Stabilization in an n-D Lattice Model , 2009, Sensors.

[48]  Anthony G. Cohn,et al.  Scene Modelling and Classification Using Learned Spatial Relations , 2009, COSIT.

[49]  Eren Erdal Aksoy,et al.  Categorizing object-action relations from semantic scene graphs , 2010, 2010 IEEE International Conference on Robotics and Automation.

[50]  Eren Erdal Aksoy,et al.  3d semantic representation of actions from effcient stereo-image-sequence segmentation on GPUs , 2010 .

[51]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Mark Steedman,et al.  Object-Action Complexes: Grounded abstractions of sensory-motor processes , 2011, Robotics Auton. Syst..