A spatial-temporal attention model for human trajectory prediction

Human trajectory prediction is essential and promising in many related applications. This is challenging due to the uncertainty of human behaviors, which can be influenced not only by himself, but also by the surrounding environment. Recent works based on long-short term memory ( LSTM ) models have brought tremendous improvements on the task of trajectory prediction. However, most of them focus on the spatial influence of humans but ignore the temporal influence. In this paper, we propose a novel spatial-temporal attention ( ST-Attention ) model, which studies spatial and temporal affinities jointly. Specifically, we introduce an attention mechanism to extract temporal affinity, learning the importance for historical trajectory information at different time instants. To explore spatial affinity, a deep neural network is employed to measure different importance of the neighbors. Experimental results show that our method achieves competitive performance compared with state-of-the-art methods on publicly available datasets.

[1]  Nanning Zheng,et al.  SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Dongbin Zhao,et al.  A Semi-Supervised Predictive Sparse Decomposition Based on Task-Driven Dictionary Learning , 2017, Cognitive Computation.

[3]  Jiujun Cheng,et al.  Dendritic Neuron Model With Effective Learning Algorithms for Classification, Approximation, and Prediction , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Dani Lischinski,et al.  Crowds by Example , 2007, Comput. Graph. Forum.

[5]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[7]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Dongbin Zhao,et al.  Deep Kalman Filter with Optical Flow for Multiple Object Tracking , 2019, 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

[12]  Xiaogang Wang,et al.  Pedestrian Behavior Understanding and Prediction with Deep Neural Networks , 2016, ECCV.

[13]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[14]  Wei Liu,et al.  Deep Learning Driven Visual Path Prediction From a Single Image , 2016, IEEE Transactions on Image Processing.

[15]  Hao Su,et al.  Real-time multiple object tracking based on optical flow , 2019, 2019 9th International Conference on Information Science and Technology (ICIST).

[16]  Shenghua Gao,et al.  Encoding Crowd Interaction with Deep Neural Network for Pedestrian Trajectory Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Alexander G. Hauptmann,et al.  Minding the Gaps in a Video Action Analysis Pipeline , 2019, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[18]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[19]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Hua Han,et al.  Can Virtual Samples Solve Small Sample Size Problem of KISSME in Pedestrian Re-Identification of Smart Transportation? , 2020, IEEE Transactions on Intelligent Transportation Systems.

[22]  Dongbin Zhao,et al.  Deep Reinforcement Learning With Visual Attention for Vehicle Classification , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[23]  Qichao Zhang,et al.  Reinforcement Learning and Deep Learning based Lateral Control for Autonomous Driving , 2018, IEEE Comput. Intell. Mag..

[24]  Bo Zhang,et al.  Forecast the Plausible Paths in Crowd Scenes , 2017, IJCAI.

[25]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  George Loizou,et al.  Computer vision and pattern recognition , 2007, Int. J. Comput. Math..

[27]  Qichao Zhang,et al.  Graph Attention Memory for Visual Navigation , 2019, ArXiv.

[28]  Hua Han,et al.  A New Deep Learning Method Based on Unsupervised Domain Adaptation and Re-ranking in Person Re-identification , 2020, Int. J. Pattern Recognit. Artif. Intell..

[29]  Tim J. Ellis,et al.  Path detection in video surveillance , 2002, Image Vis. Comput..

[30]  Ganggui Qu,et al.  Stochastic iterative learning control with faded signals , 2019, IEEE/CAA Journal of Automatica Sinica.

[31]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[32]  Francesco Piazza,et al.  Unsupervised electric motor fault detection by using deep autoencoders , 2019, IEEE/CAA Journal of Automatica Sinica.

[33]  Jiajun Wang,et al.  Parameter optimization of interval Type-2 fuzzy neural networks based on PSO and BBBC methods , 2019, IEEE/CAA Journal of Automatica Sinica.

[34]  Dirk Helbing,et al.  Specification of the Social Force Pedestrian Model by Evolutionary Adjustment to Video Tracking Data , 2007, Adv. Complex Syst..

[35]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[36]  Jonathan G. Fiscus,et al.  TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search , 2018, TRECVID.

[37]  Juan Carlos Niebles,et al.  Peeking Into the Future: Predicting Future Person Activities and Locations in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Qichao Zhang,et al.  Multi-task learning for dangerous object detection in autonomous driving , 2017, Inf. Sci..

[39]  Abdullah Abusorrah,et al.  KISS+ for Rapid and Accurate Pedestrian Re-Identification , 2021, IEEE Transactions on Intelligent Transportation Systems.

[40]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.