Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions

In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI). Based on Gibbs sampling, our weakly supervised grammar learning can automatically construct a hierarchical representation of an interaction with long-term joint sub-tasks of both agents and short term atomic actions of individual agents. Based on a new RGB-D video dataset with rich instances of human interactions, our experiments of Baxter simulation, human evaluation, and real Baxter test demonstrate that the model learned from limited training data successfully generates human-like behaviors in unseen scenarios and outperforms both baselines.

[1]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[2]  Hedvig Kjellström,et al.  Recognizing object affordances in terms of spatio-temporal object-object relationships , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[3]  Steven M. LaValle,et al.  RRT-connect: An efficient approach to single-query path planning , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[4]  Luc De Raedt,et al.  Learning relational affordance models for robots in multi-object manipulation tasks , 2012, 2012 IEEE International Conference on Robotics and Automation.

[5]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Majid Nili Ahmadabadi,et al.  Inverse Kinematics Based Human Mimicking System using Skeletal Tracking Technology , 2017, J. Intell. Robotic Syst..

[7]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[8]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[10]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Qi Cheng,et al.  Robot semantic mapping through human activity recognition: A wearable sensing and computing approach , 2015, Robotics Auton. Syst..

[12]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[14]  Silvio Savarese,et al.  Action Recognition by Hierarchical Mid-Level Action Elements , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Manuel Lopes,et al.  Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[16]  Benjamin Z. Yao,et al.  Learning and parsing video events with goal and intent prediction , 2013, Comput. Vis. Image Underst..

[17]  Kai Oliver Arras,et al.  Socially-aware robot navigation: A learning approach , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[19]  Song-Chun Zhu,et al.  Robot learning with a spatial, temporal, and causal and-or graph , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Li Fei-Fei,et al.  Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[22]  Michael S. Ryoo,et al.  Learning Social Affordance for Human-Robot Interaction , 2016, IJCAI.

[23]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[24]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[25]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[26]  Kai Oliver Arras,et al.  Learning socially normative robot navigation behaviors with Bayesian inverse reinforcement learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Tae-Kyun Kim,et al.  A syntactic approach to robot imitation learning using probabilistic activity grammars , 2013, Robotics Auton. Syst..

[28]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[29]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Hema Swetha Koppula,et al.  Physically Grounded Spatio-temporal Object Affordances , 2014, ECCV.