Dear editor, Semantic segmentation aims to assign the category information to all pixels of an image and plays a vital role in image understanding. In the past few years, deep convolutional neural networks have achieved great success in a large variety of computer vision tasks. Inspired by the advances of CNN in recognition, a fully convolutional network (FCN) is developed in an end-to-end, pixel-to-pixel training manner for semantic segmentation. Owing to the computational efficiency for dense prediction and end-to-end learning manner, numerous variants of FCN are then proposed to boost the performance of semantic segmentation. The excellent performance of deep models, however, highly relies on expensive and laborious label annotations of massive images [1]. Actually, most existing deep learning models for semantic segmentation are firstly pre-trained on millions of images with sample-level annotations, e.g., ImageNet, and then fine-tuned with thousands of pixel-wise annotated images [1]. There remain three issues for semantic segmentation. • The annotation for semantic segmentation has to be conducted pixel by pixel, which is labor intensive. • There are inexhaustible unlabeled or partially labeled images in the wild. The recent advances in object detection [2] and image classification [3] show that large-scale unlabeled data can be made good use of to boost the model performance. • Almost all existing benchmarks ignore the difference of images and only provide pixel-level annotations. There are a large number of images that contribute little to the learning of the segmentation models. To reduce the labor cost for image annotation, several interactive segmentation models and tools have been developed by using weakly supervised information, e.g, clickpoints, lines, curves or bounding boxes [4]. Nevertheless, these studies are proposed for interactive annotation for a single image rather than annotating images in batch. To exploit the informative images in the wild, researchers have introduced active learning [5], semi-supervised learning [6], uncertainty learning [7], incremental learning [8], context learning [9] and self-supervised learning [1] for model enhancement. To sum up, we wonder whether we can annotate the unlabeled image with the least human labor and train a state-of-the-art segmentation model using the least data. In this study, we propose a human-in-theloop segmentation (HISE) framework, which is combined with a classic semantic segmentation model, i.e., FCN. We conduct experiments on seven benchmark datasets: DAVIS2016, MSRA-B, MSRA10K, ECSSD, DUT-OMRON, HKU-IS, and JUDD, to verify the effectiveness of the proposed HISE framework. Experimental results show that HISE can achieve comparable performance with much fewer human annotations and output a seg-
[1]
Kristen Grauman,et al.
Active Image Segmentation Propagation
,
2016,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2]
Xin Liu,et al.
Noisy Face Image Sets Refining Collaborated with Discriminant Feature Space Learning
,
2017,
2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).
[3]
Sanja Fidler,et al.
Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++
,
2018,
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[4]
Kang Chen,et al.
Uncertainty-optimized deep learning model for small-scale person re-identification
,
2019,
Science China Information Sciences.
[5]
Lei Zhang,et al.
Cost-Effective Object Detection: Active Sample Mining With Switchable Selection Criteria
,
2018,
IEEE Transactions on Neural Networks and Learning Systems.
[6]
Xiaoou Tang,et al.
Mix-and-Match Tuning for Self-Supervised Semantic Segmentation
,
2017,
AAAI.
[7]
Ruimao Zhang,et al.
Cost-Effective Active Learning for Deep Image Classification
,
2017,
IEEE Transactions on Circuits and Systems for Video Technology.
[8]
George Papandreou,et al.
Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation
,
2015,
2015 IEEE International Conference on Computer Vision (ICCV).
[9]
Song Bai,et al.
Feature context learning for human parsing
,
2019,
Science China Information Sciences.