RAIN: Reinforced Hybrid Attention Inference Network for Motion Forecasting

Motion forecasting plays a significant role in various domains (e.g., autonomous driving, human-robot interaction), which aims to predict future motion sequences given a set of historical observations. However, the observed elements may be of different levels of importance. Some information may be irrelevant or even distracting to the forecasting in certain situations. To address this issue, we propose a generic motion forecasting framework (named RAIN) with dynamic key information selection and ranking based on a hybrid attention mechanism. The general framework is instantiated to handle multi-agent trajectory prediction and human motion forecasting tasks, respectively. In the former task, the model learns to recognize the relations between agents with a graph representation and to determine their relative significance. In the latter task, the model learns to capture the temporal proximity and dependency in longterm human motions. We also propose an effective doublestage training pipeline with an alternating training strategy to optimize the parameters in different modules of the framework. We validate the framework on both synthetic simulations and motion forecasting benchmarks in different domains, demonstrating that our method not only achieves state-of-the-art forecasting performance, but also provides interpretable and reasonable hybrid attention weights.

[1]  Dieter Fox,et al.  Causal Discovery in Physical Systems from Videos , 2020, NeurIPS.

[2]  Yanfeng Wang,et al.  Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kristian Kersting,et al.  Structured Object-Aware Physics Prediction for Video Modeling and Planning , 2020, ICLR.

[4]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chengqi Zhang,et al.  Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling , 2018, IJCAI.

[6]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.

[7]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[8]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[9]  Behzad Dariush,et al.  Looking to Relations for Future Trajectory Forecast , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[11]  Massimiliano Pontil,et al.  Learning Discrete Structures for Graph Neural Networks , 2019, ICML.

[12]  R. Zemel,et al.  Neural Relational Inference for Interacting Systems , 2018, ICML.

[13]  Yi Pan,et al.  Multi-Horizon Time Series Forecasting with Temporal Attention Learning , 2019, KDD.

[14]  Chiho Choi,et al.  Shared Cross-Modal Trajectory Prediction for Autonomous Driving , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Roger Zimmermann,et al.  Towards Natural and Accurate Future Motion Prediction of Humans and Animals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[18]  Boyu Wang,et al.  Active Vision for Early Recognition of Human Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Masayoshi Tomizuka,et al.  Conditional Generative Neural System for Probabilistic Trajectory Prediction , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Elena Corina Grigore,et al.  CoverNet: Multimodal Behavior Prediction Using Trajectory Sets , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Wenhao Wu,et al.  Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Silvio Savarese,et al.  Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks , 2019, NeurIPS.

[25]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[26]  Mathieu Salzmann,et al.  History Repeats Itself: Human Motion Prediction via Motion Attention , 2020, ECCV.

[27]  Shuiwang Ji,et al.  Graph Representation Learning via Hard and Channel-Wise Attention Networks , 2019, KDD.

[28]  Abduallah A. Mohamed,et al.  Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Srikanth Malla,et al.  LOKI: Long Term and Key Intentions for Trajectory Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[32]  Garrison W. Cottrell,et al.  A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction , 2017, IJCAI.

[33]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[34]  Mohan M. Trivedi,et al.  Convolutional Social Pooling for Vehicle Trajectory Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  Masayoshi Tomizuka,et al.  Spatio-Temporal Graph Dual-Attention Network for Multi-Agent Prediction and Tracking , 2021, IEEE Transactions on Intelligent Transportation Systems.

[36]  David Isele,et al.  Reinforcement Learning for Autonomous Driving with Latent State Inference and Spatial-Temporal Relationships , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Yedid Hoshen,et al.  VAIN: Attentional Multi-agent Predictive Modeling , 2017, NIPS.

[38]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Dimitris N. Metaxas,et al.  MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zhaoxin Li,et al.  STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jean Oh,et al.  Social Attention: Modeling Attention in Human Crowds , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Ming-Hsuan Yang,et al.  PiCANet: Learning Pixel-Wise Contextual Attention for Saliency Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Alexander Schwing,et al.  Dynamic Neural Relational Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  M. Tomizuka,et al.  EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning , 2020, NeurIPS.

[45]  Ruslan Salakhutdinov,et al.  Multiple Futures Prediction , 2019, NeurIPS.

[46]  Benjamin Sapp,et al.  MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction , 2019, CoRL.

[47]  Masayoshi Tomizuka,et al.  Continual Multi-Agent Interaction Behavior Prediction With Conditional Generative Memory , 2021, IEEE Robotics and Automation Letters.

[48]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[51]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[53]  Dragomir Anguelov,et al.  VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Marco Pavone,et al.  Trajectron++: Multi-Agent Generative Trajectory Forecasting With Heterogeneous Data for Control , 2020, ArXiv.

[56]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[57]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Hassan Foroosh,et al.  Self-Attention Network for Skeleton-based Human Action Recognition , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[60]  Ting Zhao,et al.  Pyramid Feature Attention Network for Saliency Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[62]  Masayoshi Tomizuka,et al.  Multi-Agent Driving Behavior Prediction across Different Scenarios with Self-Supervised Domain Knowledge , 2021, 2021 IEEE International Intelligent Transportation Systems Conference (ITSC).

[63]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).