Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems

Predicting the future location of vehicles is essential for safety-critical applications such as advanced driver assistance systems (ADAS) and autonomous driving. This paper introduces a novel approach to simultaneously predict both the location and scale of target vehicles in the first-person (egocentric) view of an ego-vehicle. We present a multi-stream recurrent neural network (RNN) encoder-decoder model that separately captures both object location and scale and pixel-level observations for future vehicle localization. We show that incorporating dense optical flow improves prediction results significantly since it captures information about motion as well as appearance change. We also find that explicitly modeling future motion of the ego-vehicle improves the prediction accuracy, which could be especially beneficial in intelligent and automated vehicles that have motion planning capability. To evaluate the performance of our approach, we present a new dataset of first-person videos collected from a variety of scenarios at road intersections, which are particularly challenging moments for prediction because vehicle trajectories are diverse and dynamic. Code and dataset have been made available at: https://usa.honda-ri.com/hevi

[1]  Sujitha Martin,et al.  Goal-oriented Object Importance Estimation in On-road Driving Videos , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[2]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[6]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Jianbo Shi,et al.  First Person Action-Object Detection with EgoNet , 2016, Robotics: Science and Systems.

[8]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[9]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[10]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[14]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[16]  Takahiro Okabe,et al.  Fast unsupervised ego-action learning for first-person sports videos , 2011, CVPR 2011.

[17]  Jianbo Shi,et al.  Egocentric Future Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  David González,et al.  A Review of Motion Planning Techniques for Automated Vehicles , 2016, IEEE Transactions on Intelligent Transportation Systems.

[22]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mohan M. Trivedi,et al.  Convolutional Social Pooling for Vehicle Trajectory Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Silvio Savarese,et al.  Single-source Attention Path Prediction Multi-source Attention Predicted Observed , 2018 .

[25]  Yong Jae Lee,et al.  Identifying First-Person Camera Wearers in Third-Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[27]  Philip H. S. Torr,et al.  DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ali Borji,et al.  Ego2Top: Matching Viewers in Egocentric and Top-View Videos , 2016, ECCV.

[29]  Jianbo Shi,et al.  Egocentric Basketball Motion Planning from a Single First-Person Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Bernt Schiele,et al.  Long-Term On-board Prediction of People in Traffic Scenes Under Uncertainty , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yu Yao,et al.  The Smart Black Box: A Value-Driven Automotive Event Data Recorder , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[33]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jianbo Shi,et al.  Predicting Behaviors of Basketball Players from First Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Xin Ye,et al.  Active Object Perceiver: Recognition-Guided Policy Learning for Object Searching on Mobile Robots , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[40]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[41]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Michael S. Ryoo,et al.  Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos , 2018, ECCV.

[43]  Ulrich Kressel,et al.  Probabilistic trajectory prediction with Gaussian mixture models , 2012, 2012 IEEE Intelligent Vehicles Symposium.

[44]  Larry S. Davis,et al.  Temporal Recurrent Networks for Online Action Detection , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Alexey Dosovitskiy,et al.  End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Mohan M. Trivedi,et al.  How Would Surround Vehicles Move? A Unified Framework for Maneuver Classification and Motion Prediction , 2018, IEEE Transactions on Intelligent Vehicles.