Gated Recurrent Fusion to Learn Driving Behavior from Temporal Multimodal Data

The Tactical Driver Behavior modeling problem requires an understanding of driver actions in complicated urban scenarios from rich multimodal signals including video, LiDAR and CAN signal data streams. However, the majority of deep learning research is focused either on learning the vehicle/environment state (sensor fusion) or the driver policy (from temporal data), but not both. Learning both tasks jointly offers the richest distillation of knowledge but presents challenges in the formulation and successful training. In this work, we propose promising first steps in this direction. Inspired by the gating mechanisms in Long Short-Term Memory units (LSTMs), we propose Gated Recurrent Fusion Units (GRFU) that learn fusion weighting and temporal weighting simultaneously. We demonstrate it's superior performance over multimodal and temporal baselines in supervised regression and classification tasks, all in the realm of autonomous navigation. On tactical driver behavior classification using Honda Driving Dataset (HDD), we report <inline-formula><tex-math notation="LaTeX">$10\%$</tex-math></inline-formula> improvement in mean Average Precision (mAP) score, and similarly, for steering angle regression on TORCS dataset, we note a <inline-formula><tex-math notation="LaTeX">$20\%$</tex-math></inline-formula> drop in Mean Squared Error (MSE) over the state-of-the-art.

[1]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[2]  Ahmet M. Kondoz,et al.  Fusion of LiDAR and Camera Sensor Data for Environment Sensing in Driverless Vehicles , 2017, ArXiv.

[3]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[4]  Christos Dimitrakakis,et al.  TORCS, The Open Racing Car Simulator , 2005 .

[5]  Shih-Fu Chang,et al.  Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation , 2018, ArXiv.

[6]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[7]  David Janz,et al.  Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[8]  Andrew Owens,et al.  The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes? , 2017, CoRL.

[9]  Manuela M. Veloso,et al.  Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation , 2017, CoRL.

[10]  Ignazio Gallo,et al.  Image and Encoded Text Fusion for Multi-Modal Classification , 2018, 2018 Digital Image Computing: Techniques and Applications (DICTA).

[11]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[12]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Jiebo Luo,et al.  Deep Multimodal Representation Learning from Temporal Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xuelong Li,et al.  Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cewu Lu,et al.  LiDAR-Video Driving Dataset: Learning Driving Policies Effectively , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Matthew Johnson-Roberson,et al.  Driving in the Matrix: Can virtual worlds replace human-generated annotations for real world tasks? , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[20]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[21]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[22]  Yann LeCun,et al.  Off-Road Obstacle Avoidance through End-to-End Learning , 2005, NIPS.

[23]  Qingquan Li,et al.  A Sensor-Fusion Drivable-Region and Lane-Detection System for Autonomous Vehicle Navigation in Challenging Road Scenarios , 2014, IEEE Transactions on Vehicular Technology.

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Moshe Kam,et al.  Sensor Fusion for Mobile Robot Navigation , 1997, Proc. IEEE.

[26]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Zhen Li,et al.  LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling , 2016, ECCV.

[28]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Bin Yang,et al.  Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Wolfram Burgard,et al.  Multimodal interaction-aware motion prediction for autonomous street crossing , 2018, Int. J. Robotics Res..

[34]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[38]  Li Fei-Fei,et al.  Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Daniele Loiacono,et al.  Learning drivers for TORCS through imitation using supervised methods , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[40]  Wolfram Burgard,et al.  VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry , 2018, IEEE Robotics and Automation Letters.

[41]  Ralph Helmar Rasshofer,et al.  Influences of weather phenomena on automotive laser radar systems , 2011 .

[42]  Jianqiang Wang,et al.  Object Classification Using CNN-Based Fusion of Vision and LIDAR in Autonomous Vehicle Environment , 2018, IEEE Transactions on Industrial Informatics.

[43]  Kate Saenko,et al.  Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Roland Göcke,et al.  Extending Long Short-Term Memory for Multi-View Structured Learning , 2016, ECCV.

[46]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[50]  Jerzy Z. Sasiadek,et al.  Sensor fusion based on fuzzy Kalman filtering for autonomous robot vehicle , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).