Deep tracking in the wild: End-to-end tracking using recurrent neural networks

This paper presents a novel approach for tracking static and dynamic objects for an autonomous vehicle operating in complex urban environments. Whereas traditional approaches for tracking often feature numerous hand-engineered stages, this method is learned end-to-end and can directly predict a fully unoccluded occupancy grid from raw laser input. We employ a recurrent neural network to capture the state and evolution of the environment, and train the model in an entirely unsupervised manner. In doing so, our use case compares to model-free, multi-object tracking although we do not explicitly perform the underlying data-association process. Further, we demonstrate that the underlying representation learned for the tracking task can be leveraged via inductive transfer to train an object detector in a data efficient manner. We motivate a number of architectural features and show the positive contribution of dilated convolutions, dynamic and static memory units to the task of tracking and classifying complex dynamic scenes through full occlusion. Our experimental results illustrate the ability of the model to track cars, buses, pedestrians, and cyclists from both moving and stationary platforms. Further, we compare and contrast the approach with a more traditional model-free multi-object tracking pipeline, demonstrating that it can more accurately predict future states of objects from current inputs.

[1]  Sebastian Thrun,et al.  Explanation-Based Neural Network Learning for Robot Control , 1992, NIPS.

[2]  Stefan C. Kremer,et al.  Recurrent Neural Networks , 2013, Handbook on Neural Information Processing.

[3]  Wolfram Burgard,et al.  Using Boosted Features for the Detection of People in 2D Range Data , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[4]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[5]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[6]  Shao-Wen Yang,et al.  Simultaneous egomotion estimation, segmentation, and moving object detection , 2011, J. Field Robotics.

[7]  Ingmar Posner,et al.  Voting for Voting in Online Point Cloud Object Detection , 2015, Robotics: Science and Systems.

[8]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Rich Caruana,et al.  Learning Many Related Tasks at the Same Time with Backpropagation , 1994, NIPS.

[10]  Paul Newman,et al.  Model-free detection and tracking of dynamic objects with 2D lidar , 2015, Int. J. Robotics Res..

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Dushyant Rao,et al.  Deep Tracking on the Move: Learning to Track the World from a Moving Vehicle using Recurrent Neural Networks , 2016, ArXiv.

[13]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[14]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[15]  Kyungjae Lee,et al.  Robust modeling and prediction in dynamic environments using recurrent flow networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Trung-Dung Vu,et al.  Online Localization and Mapping with Moving Object Tracking in Dynamic Outdoor Environments , 2007, 2007 IEEE Intelligent Vehicles Symposium.

[17]  Ingmar Posner,et al.  End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks , 2016, ArXiv.

[18]  Ingmar Posner,et al.  Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks , 2016, AAAI.

[19]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[20]  Liang Zhao,et al.  Qualitative and quantitative car tracking from a range image sequence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[21]  Dushyant Rao,et al.  Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[26]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[27]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[28]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[30]  Sebastian Thrun,et al.  Model based vehicle detection and tracking for autonomous urban driving , 2009, Auton. Robots.