Dynamic Traffic Scene Classification with Space-Time Coherence

This paper examines the problem of dynamic traffic scene classification under space-time variations in viewpoint that arise from video captured on-board a moving vehicle. Solutions to this problem are important for realization of effective driving assistance technologies required to interpret or predict road user behavior. Currently, dynamic traffic scene classification has not been adequately addressed due to a lack of benchmark datasets that consider spatiotemporal evolution of traffic scenes resulting from a vehicle’s ego-motion. This paper has three main contributions. First, an annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions. Second, we introduce novel and baseline algorithms that utilize semantic context and temporal nature of the dataset for dynamic classification of road scenes. Finally, we showcase algorithms and experimental results that highlight how extracted features from scene classification serve as strong priors and help with tactical driver behavior understanding. The results show significant improvement from previously reported driving behavior detection baselines in the literature.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Angelos Amditis,et al.  Dynamic road scene classification: Combining motion with a visual vocabulary model , 2013, Proceedings of the 16th International Conference on Information Fusion.

[3]  Ruigang Yang,et al.  The ApolloScape Dataset for Autonomous Driving , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Cewu Lu,et al.  LiDAR-Video Driving Dataset: Learning Driving Policies Effectively , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Sinisa Segvic,et al.  Image representations on a budget: Traffic scene classification in a restricted bandwidth scenario , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[9]  Michael Milford,et al.  Don't Look Back: Robustifying Place Categorization for Viewpoint- and Condition-Invariant Place Recognition , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[13]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[16]  Hema Swetha Koppula,et al.  Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Kate Saenko,et al.  Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Andrea Palazzi,et al.  Predicting the Driver's Focus of Attention: The DR(eye)VE Project , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Jeremy S. Smith,et al.  Traffic scene recognition based on deep CNN and VLAD spatial pyramids , 2017, 2017 International Conference on Machine Learning and Cybernetics (ICMLC).

[21]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Patrik Jonsson Road condition discrimination using weather data and camera images , 2011, 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[23]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[24]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Trevor Darrell,et al.  The BDD-Nexar Collective : A Large-Scale , Crowsourced , Dataset of Driving Scenes Vashisht Madhavan , 2017 .

[26]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Rama Chellappa,et al.  Moving vistas: Exploiting motion for describing scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cewu Lu,et al.  RSCM: Region Selection and Concurrency Model for Multi-Class Weather Recognition , 2017, IEEE Transactions on Image Processing.

[31]  Xuelong Li,et al.  A Multi-Task Framework for Weather Recognition , 2017, ACM Multimedia.

[32]  Hennie Brugman,et al.  Annotating Multi-media/Multi-modal Resources with ELAN , 2004, LREC.

[33]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Patrik Jonsson,et al.  Road Surface Status Classification Using Spectral Analysis of NIR Camera Images , 2015, IEEE Sensors Journal.

[35]  Richard P. Wildes,et al.  Dynamic Scene Recognition with Complementary Spatiotemporal Features , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Limin Wang,et al.  Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs , 2016, IEEE Transactions on Image Processing.