Human intention understanding based on object affordance and action classification

Intention understanding is a basic requirement for human-machine interaction. Action classification and object affordance recognition are two possible ways to understand human intention. In this study, Multiple Timescale Recurrent Neural Network (MTRNN) is adapted to analyze human action. Supervised MTRNN, which is an extension of Continuous Timescale Recurrent Neural Network (CTRNN), is used for action and intention classification. On the other hand, deep learning algorithms proved to be efficient in understanding complex concepts in complex real world environment. Stacked denoising auto-encoder (SDA) is used to extract human implicit intention related information from the observed objects. A feature based object detection method namely Speeded Up Robust Features (SURF) is also used to find the object information. Object affordance describes the interactions between agent and the environment. In this paper, we propose an intention recognition system using `action classification' and `object affordance information'. Experimental result shows that supervised MTRNN is able to use different information in different time period and improve the intention recognition rate by cooperating with the SDA.

[1]  Tanja Schultz,et al.  HMM-based human motion recognition with optical flow data , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[2]  Saeid Nahavandi,et al.  Measuring depth accuracy in RGBD cameras , 2013, 2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS).

[3]  Kenji Doya,et al.  Adaptive neural oscillator using continuous-time back-propagation learning , 1989, Neural Networks.

[4]  Manuel Lopes,et al.  Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[5]  Minho Lee,et al.  Supervised Multiple Timescale Recurrent Neuron Network Model for Human Action Classification , 2013, ICONIP.

[6]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[7]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[9]  Peter Stagge,et al.  Recurrent neural networks for time series classification , 2003, Neurocomputing.

[10]  Monica N. Nicolescu,et al.  Deep networks for predicting human intent with respect to objects , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[11]  Mario Cifrek,et al.  A brief introduction to OpenCV , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[12]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[13]  Yuichi Nakamura,et al.  Approximation of dynamical systems by continuous time recurrent neural networks , 1993, Neural Networks.

[14]  J. Decety,et al.  From the perception of action to the understanding of intention , 2001, Nature reviews. Neuroscience.

[15]  Minho Lee,et al.  Intention Recognition and Object Recommendation System using Deep Auto-encoder Based Affordance Model , 2013 .

[16]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[17]  Jun Tani,et al.  Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment , 2008, PLoS Comput. Biol..

[18]  Qing Chen,et al.  Dynamic Gesture Recognition , 2005, 2005 IEEE Instrumentationand Measurement Technology Conference Proceedings.