Social LSTM: Human Trajectory Prediction in Crowded Spaces

Pedestrians follow different trajectories to avoid obstacles and accommodate fellow pedestrians. Any autonomous vehicle navigating such a scene should be able to foresee the future positions of pedestrians and accordingly adjust its path to avoid collisions. This problem of trajectory prediction can be viewed as a sequence generation task, where we are interested in predicting the future trajectory of people based on their past positions. Following the recent success of Recurrent Neural Network (RNN) models for sequence prediction tasks, we propose an LSTM model which can learn general human movement and predict their future trajectories. This is in contrast to traditional approaches which use hand-crafted functions such as Social forces. We demonstrate the performance of our method on several public datasets. Our model outperforms state-of-the-art methods on some of these datasets. We also analyze the trajectories predicted by our model to demonstrate the motion behaviour learned by our model.

[1]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Eric Bonabeau,et al.  Agent-based modeling: Methods and techniques for simulating human systems , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Tim J. Ellis,et al.  Learning semantic scene models from observing activity in visual surveillance , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  Adrien Treuille,et al.  Continuum crowds , 2006, SIGGRAPH 2006.

[6]  Michel Bierlaire,et al.  Discrete Choice Models for Pedestrian Walking Behavior , 2006 .

[7]  Dani Lischinski,et al.  Crowds by Example , 2007, Comput. Graph. Forum.

[8]  Christian Laugier,et al.  Modelling Smooth Paths Using Gaussian Processes , 2007, FSR.

[9]  Zhouyu Fu,et al.  Semantic-Based Surveillance Video Retrieval , 2007, IEEE Transactions on Image Processing.

[10]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[11]  Mohan M. Trivedi,et al.  A Survey of Vision-Based Trajectory Learning and Analysis for Surveillance , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Ramakant Nevatia,et al.  Robust Object Tracking by Hierarchical Association of Detection Responses , 2008, ECCV.

[13]  Siddhartha S. Srinivasa,et al.  Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  M. Shah,et al.  Abnormal crowd behavior detection using social force model , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Andreas Krause,et al.  Unfreezing the robot: Navigation in dense, interacting crowds , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Kai Oliver Arras,et al.  People tracking with human motion predictions from social forces , 2010, 2010 IEEE International Conference on Robotics and Automation.

[18]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[19]  Anthony Hoogs,et al.  Unsupervised Learning of Functional Categories in Video Scenes , 2010, ECCV.

[20]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[21]  Jianbo Shi,et al.  Multi-hypothesis motion planning for visual object tracking , 2011, 2011 International Conference on Computer Vision.

[22]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[23]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[24]  Luis E. Ortiz,et al.  Who are you with and where are you going? , 2011, CVPR 2011.

[25]  Bodo Rosenhahn,et al.  Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[26]  Irfan A. Essa,et al.  Gaussian process regression flow for analysis of motion trajectories , 2011, 2011 International Conference on Computer Vision.

[27]  Ivan Laptev,et al.  Data-driven crowd analysis in videos , 2011, ICCV.

[28]  Xiaogang Wang,et al.  Random field topic model for semantic region analysis in crowded scenes from tracklets , 2011, CVPR 2011.

[29]  Mohan M. Trivedi,et al.  Trajectory Learning for Activity Understanding: Unsupervised, Multilevel, and Long-Term Adaptive Approach , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[31]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[32]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Parsing , 2013, ArXiv.

[34]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[35]  Andreas Krause,et al.  Robot navigation in dense human crowds: the case for cooperation , 2013, 2013 IEEE International Conference on Robotics and Automation.

[36]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Aaron F. Bobick,et al.  Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[38]  Jos Elfring,et al.  Learning intentions for improved human motion prediction , 2013, 2013 16th International Conference on Advanced Robotics (ICAR).

[39]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[40]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[41]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[42]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[43]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[44]  Amit Surana,et al.  Bayesian Nonparametric Inverse Reinforcement Learning for Switched Markov Decision Processes , 2014, 2014 13th International Conference on Machine Learning and Applications.

[45]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[47]  Dariu Gavrila,et al.  Context-Based Pedestrian Path Prediction , 2014, ECCV.

[48]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[49]  Ivan Laptev,et al.  Predicting Actions from Static Scenes , 2014, ECCV.

[50]  Fei-Fei Li,et al.  Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  David F. Fouhey,et al.  Predicting Object Dynamics in Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Wolfram Burgard,et al.  Learning to predict trajectories of cooperatively navigating agents , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Antonio Torralba,et al.  Inferring the Why in Images , 2014, ArXiv.

[55]  Andrés Fuster Guilló,et al.  A predictive model for recognizing human behaviour based on trajectory representation , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[56]  Jake K. Aggarwal,et al.  Early Recognition of Human Activities from First-Person Videos Using Onset Representations , 2014, ArXiv.

[57]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  S. Savarese,et al.  Learning an Image-Based Motion Context for Multiple People Tracking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[60]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[61]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[62]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[63]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Wei Xu,et al.  Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[66]  Jianbo Shi,et al.  Social saliency prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[68]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Lisa Anne Hendricks,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[70]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Yoshua Bengio,et al.  ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[72]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[73]  Xiaogang Wang,et al.  Understanding pedestrian behaviors from stationary crowd groups , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[75]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  In-So Kweon,et al.  AttentionNet: Aggregating Weak Directions for Accurate Object Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[77]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Diane J. Cook,et al.  Data-Driven Activity Prediction: Algorithms, Evaluation Methodology, and Applications , 2015, KDD.

[79]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[80]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.