Interaction Graphs for Object Importance Estimation in On-road Driving Videos

A vehicle driving along the road is surrounded by many objects, but only a small subset of them influence the driver’s decisions and actions. Learning to estimate the importance of each object on the driver’s real-time decision-making may help better understand human driving behavior and lead to more reliable autonomous driving systems. Solving this problem requires models that understand the interactions between the ego-vehicle and the surrounding objects. However, interactions among other objects in the scene can potentially also be very helpful, e.g., a pedestrian beginning to cross the road between the ego-vehicle and the car in front will make the car in front less important. We propose a novel framework for object importance estimation using an interaction graph, in which the features of each object node are updated by interacting with others through graph convolution. Experiments show that our model outperforms state-of-the-art baselines with much less input and pre-processing.

[1]  Huchuan Lu,et al.  Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[3]  Kelly L. Madole,et al.  The relation between infants' activity with objects and attention to object appearance. , 2008, Developmental psychology.

[4]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Jianbo Shi,et al.  Unsupervised Learning of Important Objects from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[8]  Abhinav Gupta,et al.  Beyond Grids: Learning Graph Representations for Visual Recognition , 2018, NeurIPS.

[9]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[11]  Michael C. Mozer,et al.  Computational modeling of spatial attention , 1996 .

[12]  Jianbo Shi,et al.  First Person Action-Object Detection with EgoNet , 2016, Robotics: Science and Systems.

[13]  J. Vercher,et al.  Eye-Hand Coordination in Rhythmical Pointing , 2009, Journal of motor behavior.

[14]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  David Whitney,et al.  Predicting Driver Attention in Critical Situations , 2017, ACCV.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  D. Ballard,et al.  Eye movements in natural behavior , 2005, Trends in Cognitive Sciences.

[18]  R. Johansson,et al.  Eye–hand coordination in a sequential target contact task , 2009, Experimental Brain Research.

[19]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[22]  Byeongkeun Kang,et al.  A computational framework for driver's visual attention using a fully convolutional architecture , 2017, 2017 IEEE Intelligent Vehicles Symposium (IV).

[23]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[24]  Zhuowen Tu,et al.  Deeply Supervised Salient Object Detection with Short Connections , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrea Palazzi,et al.  Predicting the Driver's Focus of Attention: The DR(eye)VE Project , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Sujitha Martin,et al.  Learning to Attend to Salient Targets in Driving Videos Using Fully Convolutional RNN , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[27]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[28]  J. McCarley,et al.  Manual and oculomotor performance develop contemporaneously but independently during continuous tracking , 2009, Experimental Brain Research.

[29]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[30]  Harold Pashler Computational modeling of spatial attention , 2016 .

[31]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Sujitha Martin,et al.  Goal-oriented Object Importance Estimation in On-road Driving Videos , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[34]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Junwei Han,et al.  DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  David J. Crandall,et al.  A Self Validation Network for Object-Level Human Attention Estimation , 2019, NeurIPS.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Chen Yu,et al.  From Coarse Attention to Fine-Grained Gaze: A Two-stage 3D Fully Convolutional Network for Predicting Eye Gaze in First Person Video , 2018, BMVC.

[41]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Takahiro Okabe,et al.  Attention Prediction in Egocentric Video Using Motion and Visual Saliency , 2011, PSIVT.

[43]  Qi Zhao,et al.  Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).