Latent Variable Nested Set Transformers & AutoBots

Humans have the innate ability to attend to the most relevant actors in their vicinity and can forecast how they may behave in the future. This ability will be crucial for the deployment of safetycritical agents such as robots or vehicles which interact with humans. We propose a theoretical framework for this problem setting based on autoregressively modelling sequences of nested sets, using latent variables to better capture multimodal distributions over future sets of sets. We present a new model architecture which we call a Nested Set Transformer which employs multihead self-attention blocks over sets of sets that serve as a form of social attention between the elements of the sets at every timestep. Our approach can produce a distribution over future trajectories for all agents under consideration, or focus upon the trajectory of an ego-agent. We validate the Nested Set Transformer for autonomous driving settings which we refer to as (“AutoBot”), where we model the trajectory of an ego-agent based on the sequential observations of key attributes of multiple agents in a scene. AutoBot produces results better than state-of-the-art published prior work on the challenging nuScenes vehicle trajectory modeling benchmark. We also examine the multi-agent prediction version of our model and jointly forecast an ego-agent’s future trajectory along with the other agents in the scene. We validate the behavior of our proposed Nested Set Transformer for scene level forecasting with a pedestrian trajectory dataset. Polytechnique Montreal Mila Element AI / Service Now Independent Robotics Algolux École de technologie supérieure McGill University Canada CIFAR AI Chair Princeton University. Correspondence to: Roger Girgis <roger.girgis@gmail.com>.

[1]  Elena Corina Grigore,et al.  CoverNet: Multimodal Behavior Prediction Using Trajectory Sets , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergio Casas,et al.  Implicit Latent Variable Model for Scene-Consistent Motion Forecasting , 2020, ECCV.

[3]  Dragomir Anguelov,et al.  VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Marco Pavone,et al.  Trajectron++: Multi-Agent Generative Trajectory Forecasting With Heterogeneous Data for Control , 2020, ArXiv.

[6]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Henggang Cui,et al.  Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[8]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[10]  Mohan M. Trivedi,et al.  Multi-Head Attention with Joint Agent-Map Representation for Trajectory Prediction in Autonomous Driving , 2020, ArXiv.

[11]  Kevin Leyton-Brown,et al.  Deep Learning for Predicting Human Strategic Behavior , 2016, NIPS.

[12]  D. Ramanan,et al.  What-If Motion Prediction for Autonomous Driving , 2020, ArXiv.

[13]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[14]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Raquel Urtasun,et al.  End-to-end Contextual Perception and Prediction with Interaction Transformer , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Ruslan Salakhutdinov,et al.  Multiple Futures Prediction , 2019, NeurIPS.

[19]  Peng Xu,et al.  Variational Transformers for Diverse Response Generation , 2020, ArXiv.

[20]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[23]  Xiaojun Wan,et al.  T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion , 2019, IJCAI.