Motion Transformer with Global Intention Localization and Local Movement Refinement

Predicting multimodal future behavior of traffic participants is essential for robotic vehicles to make safe decisions. Existing works explore to directly predict future trajectories based on latent features or utilize dense goal candidates to identify agent's destinations, where the former strategy converges slowly since all motion modes are derived from the same feature while the latter strategy has efficiency issue since its performance highly relies on the density of goal candidates. In this paper, we propose Motion TRansformer (MTR) framework that models motion prediction as the joint optimization of global intention localization and local movement refinement. Instead of using goal candidates, MTR incorporates spatial intention priors by adopting a small set of learnable motion query pairs. Each motion query pair takes charge of trajectory prediction and refinement for a specific motion mode, which stabilizes the training process and facilitates better multimodal predictions. Experiments show that MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges, ranking 1st on the leaderboards of Waymo Open Motion Dataset. The source code is available at https://github.com/sshaoshuai/MTR.

[1]  James Hays,et al.  Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting , 2023, NeurIPS Datasets and Benchmarks.

[2]  B. Schiele,et al.  MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge - Motion Prediction , 2022, ArXiv.

[3]  Xiaocheng Tang,et al.  Golfer: Trajectory Prediction with Masked Goal Conditioning MnM Network , 2022, ArXiv.

[4]  Ziyao Xu,et al.  TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction , 2022, ArXiv.

[5]  S. Konev MPA: MultiPath++ Based Architecture for Motion Prediction , 2022, ArXiv.

[6]  S. Konev,et al.  MotionCNN: A Strong Baseline for Motion Prediction in Autonomous Driving , 2022, ArXiv.

[7]  Hongsheng Li,et al.  MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection , 2022, ECCV.

[8]  Junchi Yan,et al.  HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding , 2022, ArXiv.

[9]  Jiaya Jia,et al.  Stratified Transformer for 3D Point Cloud Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  H. Shum,et al.  DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , 2022, ICLR.

[11]  B. Schiele,et al.  A Unified Query-based Paradigm for Point Cloud Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  B. Williams,et al.  M2I: From Factored Marginal Trajectory Prediction to Interactive Prediction , 2022, Computer Vision and Pattern Recognition.

[13]  Hang Su,et al.  DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[14]  Benjamin Sapp,et al.  MultiPath++: Efficient Information Fusion and Trajectory Aggregation for Behavior Prediction , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[15]  Medhini Narasimhan,et al.  Multi-Person 3D Motion Prediction with Multi-Range Transformers , 2021, NeurIPS.

[16]  Fabien Moutarde,et al.  THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling , 2021, ICLR.

[17]  Fabien Moutarde,et al.  GOHOME: Graph-Oriented Heatmap Output for future Motion Estimation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[18]  Hang Zhao,et al.  DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Depu Meng,et al.  Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[21]  Fabien Moutarde,et al.  HOME: Heatmap Output for future Motion Estimation , 2021, 2021 IEEE International Intelligent Transportation Systems Conference (ITSC).

[22]  X. Zhang,et al.  MOTR: End-to-End Multiple-Object Tracking with TRansformer , 2021, ECCV.

[23]  Jiquan Ngiam,et al.  Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Benjamin Sapp,et al.  Identifying Driver Interactions via Conditional Behavior Prediction , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Bolei Zhou,et al.  Multimodal Motion Prediction with Stacked Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Qifeng Chen,et al.  TPCN: Temporal Point Cloud Networks for Motion Forecasting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Mingyu Fan,et al.  Tra2Tra: Trajectory-to-Trajectory Prediction With a Global Social Spatial-Temporal Attentive Neural Network , 2021, IEEE Robotics and Automation Letters.

[28]  Raquel Urtasun,et al.  MP3: A Unified Model to Map, Perceive, Predict and Plan , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[30]  Boris Yangel,et al.  PRANK: motion Prediction based on RANKing , 2020, NeurIPS.

[31]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[32]  Yi Shen,et al.  TNT: Target-driveN Trajectory Prediction , 2020, CoRL.

[33]  R. Urtasun,et al.  Learning Lane Graph Representations for Motion Forecasting , 2020, ECCV.

[34]  Sergio Casas,et al.  Implicit Latent Variable Model for Scene-Consistent Motion Forecasting , 2020, ECCV.

[35]  A. Bimbo,et al.  MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[37]  Dragomir Anguelov,et al.  VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Bolei Zhou,et al.  TPNet: Trajectory Proposal Network for Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  J. Malik,et al.  It Is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction , 2020, ECCV.

[40]  Jianping Wang,et al.  A Novel Learning Framework for Sampling-Based Motion Planning in Autonomous Driving , 2020, AAAI.

[41]  Masayoshi Tomizuka,et al.  EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning , 2020, NeurIPS.

[42]  Louis-Philippe Morency,et al.  Diverse and Admissible Trajectory Forecasting through Multimodal Context Understanding , 2020, ECCV.

[43]  Marco Pavone,et al.  Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data , 2020, ECCV.

[44]  Freddy A. Boulton,et al.  CoverNet: Multimodal Behavior Prediction Using Trajectory Sets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ruslan Salakhutdinov,et al.  Multiple Futures Prediction , 2019, NeurIPS.

[46]  Renjie Liao,et al.  SpAGNN: Spatially-Aware Graph Neural Networks for Relational Behavior Forecasting from Sensor Data , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Benjamin Sapp,et al.  MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction , 2019, CoRL.

[48]  Jean Pierre Mercat,et al.  Multi-Head Attention for Multi-Modal Joint Vehicle Motion Forecasting , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Dongchun Ren,et al.  StarNet: Pedestrian Trajectory Prediction using Deep Neural Network in Star Topology , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[50]  Benjamin Sapp,et al.  Rules of the Road: Predicting Driving Behavior With a Convolutional Model of Semantic Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Sergey Levine,et al.  PRECOG: PREdiction Conditioned on Goals in Visual Multi-Agent Settings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Sergio Casas,et al.  IntentNet: Learning to Predict Intention from Raw Sensor Data , 2018, CoRL.

[53]  Paul Vernaza,et al.  r2p2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting , 2018, ECCV.

[54]  J. Schneider,et al.  Uncertainty-aware Short-term Motion Prediction of Traffic Actors for Autonomous Driving , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[58]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jonathon Shlens,et al.  Scene Transformer: A unified architecture for predicting future trajectories of multiple agents , 2022, ICLR.

[61]  David Wu,et al.  AIR2 for Interaction Prediction , 2021, ArXiv.

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.