DLow: Diversifying Latent Flows for Diverse Human Motion Prediction

Deep generative models are often used for human motion prediction as they are able to model multi-modal data distributions and characterize diverse human behavior. While much care has been taken into designing and learning deep generative models, how to efficiently produce diverse samples from a deep generative model after it has been trained is still an under-explored problem. To obtain samples from a pretrained generative model, most existing generative human motion prediction methods draw a set of independent Gaussian latent codes and convert them to motion samples. Clearly, this random sampling strategy is not guaranteed to produce diverse samples for two reasons: (1) The independent sampling cannot force the samples to be diverse; (2) The sampling is based solely on likelihood which may only produce samples that correspond to the major modes of the data distribution. To address these problems, we propose a novel sampling method, Diversifying Latent Flows (DLow), to produce a diverse set of samples from a pretrained deep generative model. Unlike random (independent) sampling, the proposed DLow sampling method samples a single random variable and then maps it with a set of learnable mapping functions to a set of correlated latent codes. The correlated latent codes are then decoded into a set of correlated samples. During training, DLow uses a diversity-promoting prior over samples as an objective to optimize the latent mappings to improve sample diversity. The design of the prior is highly flexible and can be customized to generate diverse motions with common features (e.g., similar leg motion but diverse upper-body motion). Our experiments demonstrate that DLow outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy. Our code is released on the project page: this https URL.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[3]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[4]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaogang Wang,et al.  3D Human Pose Estimation in the Wild by Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[7]  Jitendra Malik,et al.  Predicting 3D Human Dynamics From Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Murat Akcakaya,et al.  A Probabilistic Active Learning Algorithm Based on Fisher Information Ratio , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[11]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[12]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[13]  Trevor Darrell,et al.  Learning Detection with Diverse Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Juan Carlos Niebles,et al.  Imitation Learning for Human Pose Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Michael Cogswell,et al.  Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles , 2016, NIPS.

[16]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[18]  Bernt Schiele,et al.  Accurate and Diverse Sampling of Sequences Based on a "Best of Many" Sample Objective , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[20]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[21]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ersin Yumer,et al.  MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics , 2018, ECCV.

[23]  Otmar Hilliges,et al.  Learning Human Motion Models for Long-Term Predictions , 2017, 2017 International Conference on 3D Vision (3DV).

[24]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Hema Swetha Koppula,et al.  Anticipating human activities for reactive robotic response , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Ben Taskar,et al.  k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[27]  Ashish Khetan,et al.  PacGAN: The Power of Two Samples in Generative Adversarial Networks , 2017, IEEE Journal on Selected Areas in Information Theory.

[28]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[29]  Gregory Shakhnarovich,et al.  Diverse M-Best Solutions in Markov Random Fields , 2012, ECCV.

[30]  Emilio Frazzoli,et al.  A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles , 2016, IEEE Transactions on Intelligent Vehicles.

[31]  Ravi Kiran Sarvadevabhatla,et al.  DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Juan Carlos Niebles,et al.  Action-Agnostic Human Pose Forecasting , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  D. Nilsson,et al.  An efficient algorithm for finding the M most probable configurationsin probabilistic expert systems , 1998, Stat. Comput..

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Kris Kitani,et al.  Diverse Trajectory Forecasting with Determinantal Point Processes , 2019, ICLR.

[36]  Kris Kitani,et al.  Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Kristen Grauman,et al.  Creating Capsule Wardrobes from Fashion Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[39]  Xiaodong Liu,et al.  Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing , 2019, NAACL.

[40]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiao Lin,et al.  Human Motion Modeling using DVGANs , 2018, ArXiv.

[42]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[43]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[44]  Lars Petersson,et al.  A Stochastic Conditioning Scheme for Diverse Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Seunghoon Hong,et al.  Diversity-Sensitive Conditional Generative Adversarial Networks , 2019, ICLR.

[46]  Graham Neubig,et al.  Lagging Inference Networks and Posterior Collapse in Variational Autoencoders , 2019, ICLR.

[47]  Jean-Louis Golmard,et al.  An algorithm directly finding the K most probable configurations in Bayesian networks , 1994, Int. J. Approx. Reason..

[48]  Kris M. Kitani,et al.  How do we use our hands? Discovering a diverse set of common grasps , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[50]  Paul Vernaza,et al.  r2p2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting , 2018, ECCV.

[51]  Kris M. Kitani,et al.  Joint 3D Tracking and Forecasting with Graph Neural Network and Diversity Sampling , 2020, ArXiv.

[52]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Francesc Moreno-Noguer,et al.  Human Motion Prediction via Spatio-Temporal Inpainting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[56]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[57]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[60]  Alexander M. Rush,et al.  Semi-Amortized Variational Autoencoders , 2018, ICML.

[61]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[62]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  R. Venkatesh Babu,et al.  BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN , 2018, AAAI.

[65]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Camille Couprie,et al.  GDPP: Learning Diverse Generations Using Determinantal Point Process , 2018, ICML.

[67]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[68]  Lars Petersson,et al.  Learning Variations in Human Motion via Mix-and-Match Perturbation , 2019, ArXiv.

[69]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[70]  Pushmeet Kohli,et al.  Multiple Choice Learning: Learning to Produce Multiple Structured Outputs , 2012, NIPS.

[71]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[72]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[73]  Ye Yuan,et al.  Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis , 2020, NeurIPS.

[74]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[75]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Ben Taskar,et al.  Expectation-Maximization for Learning Determinantal Point Processes , 2014, NIPS.

[77]  Nicholas Rhinehart,et al.  Generative Hybrid Representations for Activity Forecasting With No-Regret Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[79]  N. Troje Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.