C V ] 2 9 O ct 2 01 9 Joint COCO and Mapillary Workshop at ICCV 2019 : LVIS Challenge Track Technical Report : Classification Calibration for Long-tail Instance Segmentation

This report presents our winning solution to LVIS 2019 challenge. Remarkable progress has been made in object instance detection and segmentation in recent years. However, existing state-of-the-art methods are mostly evaluated with fairly balanced and class-limited benchmarks, such as Microsoft COCO dataset [8]. In this report, we investigate the performance drop phenomenon of state-of-theart two-stage instance segmentation models when processing extreme long-tail training data based on the LVIS [5] dataset, and find a major cause is the inaccurate classification of object proposals. Based on this observation, we propose to calibrate the prediction of classification head to improve recognition performance for the tail classes. Without much additional cost and modification of the detection model architecture, our calibration method improves the performance of the baseline by a large margin on the tail classes. Codes will be available. Importantly, after the submission, we find significant improvement can be further achieved by modifying the calibration head, which we will update later. . 1. Experimental Details Dataset statistics Different from [5], we divide all the 1,230 categories of the LVIS v0.5 dataset into 4 sets, which respectively contain < 10, 10-100, 100-1,000 and > 1,000 training object instances. We denote them as subset (0, 10), subset [10, 100), subset [100, 1000) and subset [1000, -] for convenience of expression. Please see Table 1 for detailed statistics. Beyond the test set results, we evaluate model performance based on such category split in this report, in 1 Both * authors contributed equally to this work. Sets (0, 10) [10, 100) [100, 1000) [1000,−] total Train 294 453 302 181 1230 Train-on-val 67 298 284 181 830 Table 1: Category division based on training instance number. Train-on-val means the subset of categories that appear in the validation set. order to see the effect of training instance number and analyze the long-tail object instance detection models. We claim that the improvement on the tail bin, i.e. subset (0, 10), of the validation set does not contribute much to the overall AP as it contains only 67 classes, though the category distribution of the test set is unknown. Training and Evaluation Our implementation is based on the mmdetection toolkit [4]. Unless otherwise stated, the models are trained on LVIS-v0.5 training set and evaluated on LVIS-v0.5 validation set for mask prediction tasks. The external data used in the experiments are introduced in Sec. 4. All the models are trained with SGD, 0.9 momentum and 8 images per minibatch. The training schedule is 8th/11th/12th epoch updates with learning rates of 0.01/0.001/0.0001 respectively, unless otherwise stated. 2. Classification Calibration We first investigate the performance degradation of the baseline Mask-RCNN [6] on tail classes. Then, based on our observations for the possible causes of this phenomenon, we propose a classification calibration method for improving the model performance over tail classes.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[7]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Saining Xie,et al.  Decoupling Representation and Classifier for Long-Tailed Recognition , 2019, ICLR.