Teaching Robots to Predict Human Motion

Teaching a robot to predict and mimic how a human moves or acts in the near future by observing a series of historical human movements is a crucial first step in human-robot interaction and collaboration. In this paper, we instrument a robot with such a prediction ability by leveraging recent deep learning and computer vision techniques. First, our system takes images from the robot camera as input to produce the corresponding human skeleton based on real-time human pose estimation obtained with the OpenPose library. Then, conditioning on this historical sequence, the robot forecasts plausible motion through a motion predictor, generating a corresponding demonstration. Because of a lack of high-level fidelity validation, existing forecasting algorithms suffer from error accumulation and inaccurate prediction. Inspired by generative adversarial networks (GANs), we introduce a global discriminator that examines whether the predicted sequence is smooth and realistic. Our resulting motion GAN model achieves superior prediction performance to state-of-the-art approaches when evaluated on the standard H3.6M dataset. Based on this motion GAN model, the robot demonstrates its ability to replay the predicted motion in a human-like manner when interacting with a person.

[1]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[2]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[4]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[5]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[7]  Emilio Frazzoli,et al.  A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.

[8]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jan Peters,et al.  Phase estimation for fast action recognition and trajectory generation in human–robot collaboration , 2017, Int. J. Robotics Res..

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[14]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[17]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[19]  Yaser Sheikh,et al.  Bilinear spatiotemporal basis models , 2012, TOGS.

[20]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[21]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[22]  Dmitry Berenson,et al.  Unsupervised early prediction of human reaching for human–robot collaboration in shared workspaces , 2018, Auton. Robots.

[23]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[24]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[25]  Manuela M. Veloso,et al.  Towards a Robust Interactive and Learning Social Robot , 2018, AAMAS.

[26]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH Classes.

[27]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[30]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[31]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[32]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[33]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[36]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[37]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.