Long-Term Human Video Generation of Multiple Futures Using Poses

Predicting future human behavior from an input human video is a useful task for applications such as autonomous driving and robotics. While most previous works predict a single future, multiple futures with different behavior can potentially occur. Moreover, if the predicted future is too short (e.g., less than one second), it may not be fully usable by a human or other systems. In this paper, we propose a novel method for future human pose prediction capable of predicting multiple long-term futures. This makes the predictions more suitable for real applications. Also, from the input video and the predicted human behavior, we generate future videos. First, from an input human video, we generate sequences of future human poses (i.e., the image coordinates of their body-joints) via adversarial learning. Adversarial learning suffers from mode collapse, which makes it difficult to generate a variety of multiple poses. We solve this problem by utilizing two additional inputs to the generator to make the outputs diverse, namely, a latent code (to reflect various behaviors) and an attraction point (to reflect various trajectories). In addition, we generate long-term future human poses using a novel approach based on unidimensional convolutional neural networks. Last, we generate an output video based on the generated poses for visualization. We evaluate the generated future poses and videos using three criteria (i.e., realism, diversity and accuracy), and show that our proposed method outperforms other state-of-the-art works.

[1]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[2]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[4]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[6]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[7]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[8]  Yaser Sheikh,et al.  Bilinear spatiotemporal basis models , 2012, TOGS.

[9]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[10]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[13]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[14]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[17]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Tatsuya Harada,et al.  Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture , 2017, AAAI.

[22]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[23]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[24]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[26]  Bingbing Ni,et al.  Skeleton-Aided Articulated Motion Generation , 2017, ACM Multimedia.

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .