Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture

Anticipating the future actions of a human is a widely studied problem in robotics that requires spatio-temporal reasoning. In this work we propose a deep learning approach for anticipation in sensory-rich robotics applications. We introduce a sensory-fusion architecture which jointly learns to anticipate and fuse information from multiple sensory streams. Our architecture consists of Recurrent Neural Networks (RNNs) that use Long Short-Term Memory (LSTM) units to capture long temporal dependencies. We train our architecture in a sequence-to-sequence prediction manner, and it explicitly learns to predict the future given only a partial temporal context. We further introduce a novel loss layer for anticipation which prevents over-fitting and encourages early anticipation. We use our architecture to anticipate driving maneuvers several seconds before they happen on a natural driving data set of 1180 miles. The context for maneuver anticipation comes from multiple sensors installed on the vehicle. Our approach shows significant improvement over the state-of-the-art in maneuver anticipation by increasing the precision from 77.4% to 90.5% and recall from 71.2% to 87.4%.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Ashutosh Saxena,et al.  Robobarista: Object Part Based Transfer of Manipulation Trajectories from Crowd-Sourcing in 3D Pointclouds , 2015, ISRR.

[3]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[4]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[5]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[6]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[7]  Wolfram Burgard,et al.  Learning Motion Patterns of People for Compliant Robot Motion , 2005, Int. J. Robotics Res..

[8]  Ruzena Bajcsy,et al.  Safe semi-autonomous control with enhanced driver modeling , 2012, 2012 American Control Conference (ACC).

[9]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[10]  Ruzena Bajcsy,et al.  Improved driver modeling for human-in-the-loop vehicular control , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Mohan M. Trivedi,et al.  Looking-in and looking-out vision for Urban Intelligent Assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[14]  Hema Swetha Koppula,et al.  Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[16]  Dmitry Berenson,et al.  Human-robot collaborative manipulation planning using early prediction of human motion , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Pablo Lardelli-Claret,et al.  The influence of passengers on the risk of the driver causing a car collision in Spain. Analysis of collisions from 1990 to 1999. , 2004, Accident; analysis and prevention.

[18]  Ashutosh Saxena,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[19]  Bernhard Schölkopf,et al.  Probabilistic movement modeling for intention inference in human–robot interaction , 2013, Int. J. Robotics Res..

[20]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[23]  Peter Robinson,et al.  Constrained Local Neural Fields for Robust Facial Landmark Detection in the Wild , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[24]  Luke Fletcher,et al.  Correlating driver gaze with the road scene for driver assistance systems , 2005, Robotics Auton. Syst..

[25]  Ruzena Bajcsy,et al.  Semiautonomous Vehicular Control Using Driver Modeling , 2014, IEEE Transactions on Intelligent Transportation Systems.

[26]  Siddhartha S. Srinivasa,et al.  Formalizing Assistive Teleoperation , 2012, Robotics: Science and Systems.

[27]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Wolfram Burgard,et al.  Feature-Based Prediction of Trajectories for Socially Compliant Navigation , 2012, Robotics: Science and Systems.

[29]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[30]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Kevin Lee,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[32]  Anup Doshi,et al.  Lane change intent prediction for driver assistance: On-road design and evaluation , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[33]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lars Petersson,et al.  Vision in and out of Vehicles , 2003, IEEE Intell. Syst..

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Siddhartha S. Srinivasa,et al.  Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Maria E. Jabon,et al.  Facial expression analysis for predicting unsafe driving behavior , 2011, IEEE Pervasive Computing.

[39]  Mohan M. Trivedi,et al.  On-road prediction of driver's intent with multimodal sensory cues , 2011, IEEE Pervasive Computing.

[40]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[41]  Tarak Gandhi,et al.  Looking-In and Looking-Out of a Vehicle: Computer-Vision-Based Enhanced Vehicle Safety , 2007, IEEE Transactions on Intelligent Transportation Systems.

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.