An Image-based Approach of Task-driven Driving Scene Categorization

Categorizing driving scenes via visual perception is a key technology for safe driving and the downstream tasks of autonomous vehicles. Traditional methods infer scene category by detecting scene-related objects or using a classifier that is trained on large datasets of fine-labeled scene images. Whereas at cluttered dynamic scenes such as campus or park, human activities are not strongly confined by rules, and the functional attributes of places are not strongly correlated with objects. So how to define, model and infer scene categories is crucial to make the technique really helpful in assisting a robot to pass through the scene. This paper proposes a method of task-driven driving scene categorization using weakly supervised data. Given a front-view video of a driving scene, a set of anchor points is marked by following the decision making of a human driver, where an anchor point is not a semantic label but an indicator meaning the semantic attribute of the scene is different from that of the previous one. A measure is learned to discriminate the scenes of different semantic attributes via contrastive learning, and a driving scene profiling and categorization method is developed based on that measure. Experiments are conducted on a front-view video that is recorded when a vehicle passed through the cluttered dynamic campus of Peking University. The scenes are categorized into straight road, turn road and alerting traffic. The results of semantic scene similarity learning and driving scene categorization are extensively studied, and positive result of scene categorization is 97.17 % on the learning video and 85.44% on the video of new scenes.

[1]  Antonios Gasteratos,et al.  Semantic mapping for mobile robotics tasks: A survey , 2015, Robotics Auton. Syst..

[2]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Petra Bevandic,et al.  Traffic Scene Classification on a Representation Budget , 2020, IEEE Transactions on Intelligent Transportation Systems.

[5]  John K. Tsotsos,et al.  Robot navigation via spatial and temporal coherent semantic maps , 2016, Eng. Appl. Artif. Intell..

[6]  Miguel Cazorla,et al.  Scene classification based on semantic labeling , 2016, Adv. Robotics.

[7]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[8]  Nicholas Roy,et al.  Indoor scene recognition by a mobile robot through adaptive object detection , 2013, Robotics Auton. Syst..

[9]  Jiwen Lu,et al.  Deep Metric Learning for Visual Understanding: An Overview of Recent Advances , 2017, IEEE Signal Processing Magazine.

[10]  Patric Jensfelt,et al.  Large-scale semantic mapping and reasoning with heterogeneous modalities , 2012, 2012 IEEE International Conference on Robotics and Automation.

[11]  Gordon Wyeth,et al.  Place categorization and semantic mapping on a mobile robot , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12]  El-Houssine Bouyakhf,et al.  Discriminative Deep Belief Network for Indoor Environment Classification Using Global Visual Features , 2017, Cognitive Computation.

[13]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ales Leonardis,et al.  Part-based room categorization for household service robots , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  James J. Little,et al.  Place Classification Using Visual Object Categorization and Global Information , 2011, 2011 Canadian Conference on Computer and Robot Vision.

[17]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[18]  Behzad Dariush,et al.  Dynamic Traffic Scene Classification with Space-Time Coherence , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[19]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[21]  Jeremy S. Smith,et al.  Traffic scene recognition based on deep CNN and VLAD spatial pyramids , 2017, 2017 International Conference on Machine Learning and Cybernetics (ICMLC).

[22]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[23]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[24]  Barbara Caputo,et al.  Learning Deep NBNN Representations for Robust Place Categorization , 2017, IEEE Robotics and Automation Letters.

[25]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Lei Shi,et al.  Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[29]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.