Adversarial Geometry-Aware Human Motion Prediction

We explore an approach to forecasting human motion in a few milliseconds given an input 3D skeleton sequence based on a recurrent encoder-decoder framework. Current approaches suffer from the problem of prediction discontinuities and may fail to predict human-like motion in longer time horizons due to error accumulation. We address these critical issues by incorporating local geometric structure constraints and regularizing predictions with plausible temporal smoothness and continuity from a global perspective. Specifically, rather than using the conventional Euclidean loss, we propose a novel frame-wise geodesic loss as a geometrically meaningful, more precise distance measurement. Moreover, inspired by the adversarial training mechanism, we present a new learning procedure to simultaneously validate the sequence-level plausibility of the prediction and its coherence with the input sequence by introducing two global recurrent discriminators. An unconditional, fidelity discriminator and a conditional, continuity discriminator are jointly trained along with the predictor in an adversarial manner. Our resulting adversarial geometry-aware encoder-decoder (AGED) model significantly outperforms state-of-the-art deep learning based approaches on the heavily benchmarked H3.6M dataset in both short-term and long-term predictions.

[1]  G. A. Einicke,et al.  Smoothing, Filtering and Prediction - Estimating The Past, Present and Future , 2012 .

[2]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[3]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[4]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[6]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[7]  Manuela M. Veloso,et al.  Towards a Robust Interactive and Learning Social Robot , 2018, AAMAS.

[8]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[9]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[10]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[12]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[13]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[14]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[15]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[16]  José M. F. Moura,et al.  Teaching Robots to Predict Human Motion , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[18]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[19]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[20]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[22]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[25]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[26]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hongdong Li,et al.  Rotation Averaging , 2013, International Journal of Computer Vision.

[28]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[30]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH Classes.

[31]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[32]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  F. Sebastian Grassia,et al.  Practical Parameterization of Rotations Using the Exponential Map , 1998, J. Graphics, GPU, & Game Tools.

[37]  Emilio Frazzoli,et al.  A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.

[38]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .

[39]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yaser Sheikh,et al.  Bilinear spatiotemporal basis models , 2012, TOGS.

[41]  Tamara L. Berg,et al.  Learning Temporal Transformations from Time-Lapse Videos , 2016, ECCV.

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  David J. Fleet,et al.  Topologically-constrained latent variable models , 2008, ICML '08.

[44]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[45]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[46]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[47]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[48]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[49]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[50]  Xavier Pennec,et al.  A Framework for Uncertainty and Validation of 3-D Registration Methods Based on Points and Frames , 2004, International Journal of Computer Vision.

[51]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  W. Rossmann Lie Groups: An Introduction through Linear Groups , 2002 .

[55]  Du Q. Huynh,et al.  Metrics for 3D Rotations: Comparison and Analysis , 2009, Journal of Mathematical Imaging and Vision.

[56]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[57]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[58]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[59]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[60]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).