Pointly-supervised scene parsing with uncertainty mixture

Abstract Pointly-supervised learning is an important topic for scene parsing, as dense annotation is extremely expensive and hard to scale. The state-of-the-art method harvests pseudo labels by applying thresholds upon softmax outputs (logits). There are two issues with this practice: (1) Softmax output does not necessarily reflect the confidence of the network output. (2) There is no principled way to decide on the optimal threshold. Tuning thresholds can be time-consuming for deep neural networks. Our method, by contrast, builds upon uncertainty measures instead of logits and is free of threshold tuning. We motivate the method with a large-scale analysis of the distribution of uncertainty measures, using strong models and challenging databases. This analysis leads to the discovery of a statistical phenomenon called uncertainty mixture. Specifically speaking, for each independent category, the distribution of uncertainty measures for unlabeled points is a mixture of two components (certain v.s. uncertain samples). The phenomenon of uncertainty mixture is surprisingly ubiquitous in real-world datasets like PascalContext and ADE20k. Inspired by this discovery, we propose to decompose the distribution of uncertainty measures with a Gamma mixture model, leading to a principled method to harvest reliable pseudo labels. Beyond that, we assume the uncertainty measures for labeled points are always drawn from the certain component. This amounts to a regularized Gamma mixture model. We provide a thorough theoretical analysis of this model, showing that it can be solved with an EM-style algorithm with convergence guarantee. Our method is also empirically successful. On PascalContext and ADE20k, we achieve clear margins over the baseline, notably with no threshold tuning in the pseudo label generation procedure. On the absolute scale, since our method collaborates well with strong baselines, we reach new state-of-the-art performance on both datasets.

[1]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew R. Webb Gamma mixture models for target recognition , 2000, Pattern Recognit..

[3]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yi Yang,et al.  LID 2020: The Learning from Imperfect Data Challenge Results , 2020, ArXiv.

[5]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[7]  Yunchao Wei,et al.  Weakly Supervised Scene Parsing with Point-based Distance Metric Learning , 2018, AAAI.

[8]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).