Self-Supervised Visual Terrain Classification From Unsupervised Acoustic Feature Learning

Mobile robots operating in unknown urban environments encounter a wide range of complex terrains to which they must adapt their planned trajectory for safe and efficient navigation. Most existing approaches utilize supervised learning to classify terrains from either an exteroceptive or a proprioceptive sensor modality. However, this requires a tremendous amount of manual labeling effort for each newly encountered terrain as well as for variations of terrains caused by changing environmental conditions. In this work, we propose a novel terrain classification framework leveraging an unsupervised proprioceptive classifier that learns from vehicle-terrain interaction sounds to self-supervise an exteroceptive classifier for pixel-wise semantic segmentation of images. To this end, we first learn a discriminative embedding space for vehicle-terrain interaction sounds from triplets of audio clips formed using visual features of the corresponding terrain patches and cluster the resulting embeddings. We subsequently use these clusters to label the visual terrain patches by projecting the traversed tracks of the robot into the camera images. Finally, we use the sparsely labeled images to train our semantic segmentation network in a weakly supervised manner. We present extensive quantitative and qualitative results that demonstrate that our proprioceptive terrain classifier exceeds the state-of-the-art among unsupervised methods and our self-supervised exteroceptive semantic segmentation model achieves a comparable performance to supervised learning with manually labeled data.

[1]  Anthony Stentz,et al.  Using sound to classify vehicle-terrain interactions in outdoor environments , 2012, 2012 IEEE International Conference on Robotics and Automation.

[2]  Ingmar Posner,et al.  Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Michael Happold,et al.  Enhancing Supervised Terrain Classification with Predictive Unsupervised Learning , 2006, Robotics: Science and Systems.

[4]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[5]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[6]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[7]  Roberto Cipolla,et al.  Fast-SCNN: Fast Semantic Segmentation Network , 2019, BMVC.

[8]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Junqiang Xi,et al.  Self‐supervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain , 2012, J. Field Robotics.

[10]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[11]  Silvio Savarese,et al.  GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[13]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[14]  Robert C. Bolles,et al.  Mapping, navigation, and learning for off‐road traversal , 2009, J. Field Robotics.

[15]  Yann LeCun,et al.  Deep belief net learning in a long-range vision system for autonomous off-road driving , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Urs A. Muller,et al.  Learning long-range vision for autonomous off-road driving , 2009 .

[17]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[18]  J. Andrew Bagnell,et al.  Improving robot navigation through self‐supervised online learning , 2006, J. Field Robotics.

[19]  Wolfram Burgard,et al.  Towards Robust Semantic Segmentation using Deep Fusion , 2010 .

[20]  Wolfram Burgard,et al.  Autonomous Robot Navigation in Highly Populated Pedestrian Zones , 2015, J. Field Robotics.

[21]  Zenglin Xu,et al.  Semi-supervised deep embedded clustering , 2019, Neurocomputing.

[22]  Krzysztof Walas,et al.  Where Should I Walk? Predicting Terrain Properties From Images Via Self-Supervised Learning , 2019, IEEE Robotics and Automation Letters.

[23]  Takashi Kubota,et al.  Autonomous Terrain Classification With Co- and Self-Training Approach , 2016, IEEE Robotics and Automation Letters.

[24]  Mohammed Abdessamad Bekhti,et al.  Terrain traversability analysis using multi-sensor data correlation by a mobile robot , 2014, 2014 IEEE/SICE International Symposium on System Integration.

[25]  Abel Gawel,et al.  Modular Sensor Fusion for Semantic Segmentation , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jianping Yin,et al.  Improved Deep Embedded Clustering with Local Structure Preservation , 2017, IJCAI.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[31]  Gary Witus,et al.  Terrain characterization and classification with a mobile robot , 2006, J. Field Robotics.

[32]  Sebastian Thrun,et al.  A Self-Supervised Terrain Roughness Estimator for Off-Road Autonomous Driving , 2006, UAI.

[33]  Wolfram Burgard,et al.  Multimodal interaction-aware motion prediction for autonomous street crossing , 2018, Int. J. Robotics Res..

[34]  Wolfram Burgard,et al.  Deep spatiotemporal models for robust proprioceptive terrain classification , 2017, Int. J. Robotics Res..

[35]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[36]  Karl Iagnemma,et al.  Self‐supervised terrain classification for planetary surface exploration rovers , 2012, J. Field Robotics.

[37]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[38]  Wolfram Burgard,et al.  Deep Feature Learning for Acoustics-Based Terrain Classification , 2015, ISRR.

[39]  Yann LeCun,et al.  Learning long‐range vision for autonomous off‐road driving , 2009, J. Field Robotics.

[40]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Alessandro Giusti,et al.  Learning Long-Range Perception Using Self-Supervision From Short-Range Sensors and Odometry , 2018, IEEE Robotics and Automation Letters.

[42]  D. Fox,et al.  The Best of Both Modes: Separately Leveraging RGB and Depth for Unseen Object Instance Segmentation , 2019, CoRL.

[43]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).