STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction

Human trajectory prediction is challenging and critical in various applications (e.g., autonomous vehicles and social robots). Because of the continuity and foresight of the pedestrian movements, the moving pedestrians in crowded spaces will consider both spatial and temporal interactions to avoid future collisions. However, most of the existing methods ignore the temporal correlations of interactions with other pedestrians involved in a scene. In this work, we propose a Spatial-Temporal Graph Attention network (STGAT), based on a sequence-to-sequence architecture to predict future trajectories of pedestrians. Besides the spatial interactions captured by the graph attention mechanism at each time-step, we adopt an extra LSTM to encode the temporal correlations of interactions. Through comparisons with state-of-the-art methods, our model achieves superior performance on two publicly available crowd datasets (ETH and UCY) and produces more "socially" plausible trajectories for pedestrians.

[1]  Silvio Savarese,et al.  Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes , 2016, ECCV.

[2]  W. Eric L. Grimson,et al.  Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Dinesh Manocha,et al.  GLMP- realtime pedestrian path prediction using global and local movement patterns , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[5]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Tao Xiang,et al.  Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Dmitry Berenson,et al.  Goal Set Inverse Optimal Control and Iterative Replanning for Predicting Human Reaching Motions in Shared Workspaces , 2016, IEEE Transactions on Robotics.

[8]  Adrien Treuille,et al.  Continuum crowds , 2006, SIGGRAPH 2006.

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Bo Zhang,et al.  Crowd Scene Understanding with Coherent Recurrent Neural Networks , 2016, IJCAI.

[11]  Alessio Del Bue,et al.  MX-LSTM: Mixing Tracklets and Vislets to Jointly Forecast Trajectories and Head Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Jean-Marc Odobez,et al.  Extracting and locating temporal motifs in video scenes using a hierarchical non parametric Bayesian model , 2011, CVPR 2011.

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Silvio Savarese,et al.  Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[19]  Fei-Fei Li,et al.  Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Michel Bierlaire,et al.  Discrete Choice Models for Pedestrian Walking Behavior , 2006 .

[23]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[24]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Xiaogang Wang,et al.  Understanding collective crowd behaviors: Learning a Mixture model of Dynamic pedestrian-Agents , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[31]  Luis E. Ortiz,et al.  Who are you with and where are you going? , 2011, CVPR 2011.

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[34]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[35]  Sridha Sridharan,et al.  Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection , 2017, Neural Networks.

[36]  Dani Lischinski,et al.  Crowds by Example , 2007, Comput. Graph. Forum.

[37]  Jean Oh,et al.  Social Attention: Modeling Attention in Human Crowds , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Bo Zhang,et al.  Forecast the Plausible Paths in Crowd Scenes , 2017, IJCAI.

[39]  Ramin Mehran,et al.  Abnormal crowd behavior detection using social force model , 2009, CVPR.

[40]  Xiaogang Wang,et al.  Pedestrian Behavior Understanding and Prediction with Deep Neural Networks , 2016, ECCV.

[41]  Shenghua Gao,et al.  Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.