MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory Prediction

Accurate long-term trajectory prediction in complex scenes, where multiple agents (e.g., pedestrians or vehicles) interact with each other and the environment while attempting to accomplish diverse and often unknown goals, is a challenging stochastic forecasting problem. In this work, we propose MUSE-VAE, a new probabilistic modeling framework based on a cascade of Conditional VAEs, which tackles the long-term, uncertain trajectory prediction task using a coarse-to-fine multi-factor forecasting architecture. In its Macro stage, the model learns a joint pixelspace representation of two key factors, the underlying environment and the agent movements, to predict the long and short term motion goals. Conditioned on them, the Micro stage learns a fine-grained spatio-temporal representation for the prediction of individual agent trajectories. The VAE backbones across the two stages make it possible to naturally account for the joint uncertainty at both levels of granularity. As a result, MUSE-VAE offers diverse and simultaneously more accurate predictions compared to the current state-of-the-art. We demonstrate these assertions through a comprehensive set of experiments on nuScenes and SDD benchmarks as well as PFSD, a new synthetic dataset, which challenges the forecasting ability of models on complex agent-environment interaction scenarios.

[1]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Marco Zaffalon,et al.  A Bayesian Wilcoxon signed-rank test based on the Dirichlet process , 2014, ICML.

[3]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4]  Elena Corina Grigore,et al.  CoverNet: Multimodal Behavior Prediction Using Trajectory Sets , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Osbert Bastani,et al.  Diverse Sampling for Normalizing Flow Based Trajectory Forecasting , 2020, ArXiv.

[6]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[8]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[9]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[14]  Marco Cristani,et al.  Transformer Networks for Trajectory Forecasting , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[15]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yi Shen,et al.  TNT: Target-driveN Trajectory Prediction , 2020, CoRL.

[17]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[18]  Yiming Yang,et al.  A Surprisingly Effective Fix for Deep Latent Variable Modeling of Text , 2019, EMNLP.

[19]  Kris Kitani,et al.  AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Vladimir Pavlovic,et al.  A2X: An Agent and Environment Interaction Benchmark for Multimodal Human Trajectory Prediction , 2021, MIG.

[21]  Christoph Hölscher,et al.  Taxonomy of Human Wayfinding Tasks: A Knowledge-Based Approach , 2009, Spatial Cogn. Comput..

[22]  Emmanouil Saratsis,et al.  The Optimization Potential of Floorplan Typologies in Early Design Energy Modeling , 2015, Building Simulation Conference Proceedings.

[23]  Ruslan Salakhutdinov,et al.  Multiple Futures Prediction , 2019, NeurIPS.

[24]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[25]  Klaus H. Maier-Hein,et al.  A Probabilistic U-Net for Segmentation of Ambiguous Images , 2018, NeurIPS.

[26]  Gonzalo Ferrer,et al.  Social-aware robot navigation in urban environments , 2013, 2013 European Conference on Mobile Robots.

[27]  Torrin M. Liddell,et al.  The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective , 2016, Psychonomic bulletin & review.

[28]  Xiaohui Shen,et al.  A Unified 3D Human Motion Synthesis Model via Conditional Variational Auto-Encoder∗ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  György Buzsáki,et al.  Subcircuits of Deep and Superficial CA1 Place Cells Support Efficient Spatial Coding across Heterogeneous Environments , 2020, Neuron.

[30]  Marco Pavone,et al.  The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[32]  David Kortenkamp,et al.  Prototypes, Location, and Associative Networks (PLAN): Towards a Unified Theory of Cognitive Mapping , 1995, Cogn. Sci..

[33]  Sergey Levine,et al.  PRECOG: PREdiction Conditioned on Goals in Visual Multi-Agent Settings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[35]  Yang An,et al.  From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[37]  Vladimir Pavlovic,et al.  Laying the Foundations of Deep Long-Term Crowd Flow Prediction , 2020, ECCV.

[38]  Jean Oh,et al.  Social Attention: Modeling Attention in Human Crowds , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Silvio Savarese,et al.  Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes , 2016, ECCV.

[40]  David J. Crandall,et al.  Stepwise Goal-Driven Networks for Trajectory Prediction , 2021, IEEE Robotics and Automation Letters.

[41]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[42]  Silvio Savarese,et al.  Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks , 2019, NeurIPS.

[43]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[44]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[45]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Marco Pavone,et al.  Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data , 2020, ECCV.

[47]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .