Three Ways to Improve Semantic Segmentation with Self-Supervised Depth Estimation

Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised semantic segmentation, which is enhanced by self-supervised monocular depth estimation from unlabeled images. In particular, we propose three key contributions: (1) We transfer knowledge from features learned during self-supervised depth estimation to semantic segmentation, (2) we implement a strong data augmentation by blending images and labels using the structure of the scene, and (3) we utilize the depth feature diversity as well as the level of difficulty of learning depth in a student-teacher framework to select the most useful samples to be annotated for semantic segmentation. We validate the proposed model on the Cityscapes dataset, where all three modules demonstrate significant performance gains, and we achieve state-of-the-art results for semi-supervised semantic segmentation. The implementation is available at https://github.com/lhoyer/ improving_segmentation_with_selfsupervised_

[1]  Gregory Shakhnarovich,et al.  Self-Supervised Relative Depth Learning for Urban Scene Understanding , 2017, ECCV.

[2]  Timo Aila,et al.  Consistency regularization and CutMix for semi-supervised semantic segmentation , 2019, ArXiv.

[3]  Lennart Svensson,et al.  ClassMix: Segmentation-Based Data Augmentation for Semi-Supervised Learning , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Gregory Shakhnarovich,et al.  Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Concetto Spampinato,et al.  Semi Supervised Semantic Segmentation Using Generative Adversarial Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[9]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[10]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[11]  Laurent Zwald,et al.  The BerHu penalty and the grouped effect , 2012, 1207.6868.

[12]  Tim Fingscheidt,et al.  Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance , 2020, ECCV.

[13]  Alexander H. Liu,et al.  Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Danny Z. Chen,et al.  Biomedical Image Segmentation via Representative Annotation , 2019, AAAI.

[15]  Yao Hu,et al.  Active learning via neighborhood reconstruction , 2013, IJCAI 2013.

[16]  Tae-Hyun Oh,et al.  Visuomotor Understanding for Representation Learning of Driving Scenes , 2019, BMVC.

[17]  Carsten Rother,et al.  CEREALS - Cost-Effective REgion-based Active Learning for Semantic Segmentation , 2018, BMVC.

[18]  Luc Van Gool,et al.  Revisiting Multi-Task Learning in the Deep Learning Era , 2020, ArXiv.

[19]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[20]  Zunlei Feng,et al.  DEAL: Difficulty-aware Active Learning for Semantic Segmentation , 2020, ACCV.

[21]  Lei Shi,et al.  Diversifying Convex Transductive Experimental Design for Active Learning , 2016, IJCAI.

[22]  Trevor Darrell,et al.  Variational Adversarial Active Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[24]  Andreas Bär,et al.  Improved Noise and Attack Robustness for Semantic Segmentation by Using Multi-Task Training with Self-Supervised Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[26]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[28]  Ming-Hsuan Yang,et al.  Adversarial Learning for Semi-supervised Semantic Segmentation , 2018, BMVC.

[29]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jan Kautz,et al.  SENSE: A Shared Encoder Network for Scene-Flow Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Feiping Nie,et al.  Early Active Learning via Robust Representation and Structured Sparsity , 2013, IJCAI.

[32]  Julien Valentin,et al.  ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jinbo Bi,et al.  Active learning via transductive experimental design , 2006, ICML.

[34]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[35]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Jia-Bin Huang,et al.  DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[37]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  C. Hudelot,et al.  Semi-Supervised Semantic Segmentation With Cross-Consistency Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jianping Shi,et al.  Semi-Supervised Semantic Segmentation via Dynamic Self-Training and Class-Balanced Curriculum , 2020, ArXiv.

[40]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[41]  Jelena Novosel,et al.  Boosting semantic segmentation with multi-task self-supervised learning for autonomous driving applications , 2019 .

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  Thomas Brox,et al.  Semi-Supervised Semantic Segmentation With High- and Low-Level Consistency , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Qingshan Liu,et al.  Joint Active Learning with Feature Selection via CUR Matrix Decomposition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  C. V. Jawahar,et al.  Region-based active learning for efficient labeling in semantic segmentation , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Changsheng Li,et al.  On Deep Unsupervised Active Learning , 2020, IJCAI.

[49]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[51]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[52]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[53]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[55]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[56]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Chun Chen,et al.  Active Learning Based on Locally Linear Reconstruction , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[59]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[60]  Luigi di Stefano,et al.  Geometry meets semantics for semi-supervised monocular depth estimation , 2018, ACCV.

[61]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[62]  Nicu Sebe,et al.  PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Luc Van Gool,et al.  Self-supervised Object Motion and Depth Estimation from Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[64]  Lin Yang,et al.  Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation , 2017, MICCAI.

[65]  Luc Van Gool,et al.  Multi-Task Learning for Dense Prediction Tasks: A Survey. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[66]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[67]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[68]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[69]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Timo Aila,et al.  Semi-supervised semantic segmentation needs strong, varied perturbations , 2019, BMVC.

[71]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[72]  Rynson W. H. Lau,et al.  Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss , 2018, ECCV.

[73]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[74]  Cordelia Schmid,et al.  Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Xavier Giró-i-Nieto,et al.  Cost-Effective Active Learning for Melanoma Segmentation , 2017, NIPS 2017.

[76]  Charless C. Fowlkes,et al.  Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation , 2016, ECCV.

[77]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[78]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[79]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.