Pixel-Wise Object Segmentations for the VOT 2016 Dataset

The technical report describes the acquisition methodology of pixel-wise annotation of objects of interest in the 60 video sequences of the VOT2016 dataset 1 and the automatic estimation of rotated bounding boxes obtained from the object segmentations that were used in the VOT2016 challenge evaluation. This technical report is published to accompany the segmentation data and it must be cited whenever the segmentations are used. The code for the estimation of the bounding boxes is publicly available. 1 Acquisition of Pixel-Wise Segmentations The acquisition of the segmentation was out-sourced to the Eyedea 2 company which is experienced in data annotation. The annotation was performed by multiple people using an interactive semi-automatic annotation tool written by T. Vojir 3. The tool run Grabcut [3] object segmentation in each frame initialized using the information from the VOT2015 ground truth bounding box and the segmentation mask propagated from the previous frame using optic flow. The user can interactively mark additional foreground or background pixels to improve the segmentation. Examples of the object segmentations are shown in Fig. 1. To assess the segmentation quality, a visual inspection step was incorporated in the annotation process. In the step, a quality inspection view is generated, visualising a ”sure” background mask – inverse of a dilation of the segmentation mask, and a ”sure” foreground mask – an erosion of the segmentation mask. The only guideline for annotators on the annotation quality was formulated as follows: the ”sure” background mask must not contain 1http://www.votchallenge.net/vot2016/dataset.html 2http://www.eyedea.cz/ 3https://github.com/vojirt/grabcut annotation tool 1 Figure 1: Examples of object segmentations. object pixels and the ”sure” foreground mask must not contain any background pixels. This formulation allows for a small degree of error on the object boundary. The quality inspection view is shown in Fig. 2. Since the resolutions of the video sequences of the VOT2016 dataset [2] were roughly the same, the kernel size for the morphological operations was fixed and set to (9, 9) pixels. Figure 2: The visual test for the inspection of a segmentation quality. Left – the ground truth bbox used for the segmentation initialization and currently segmented pixels (highlighted and also masked in the right image), middle – the quality inspection, i.e. the background mask (red) that have to be fully outside the object and the foreground mask (green) that has to be inside the object. Segmentation quality differences are caused mainly by carefulness of the particular annotators. Also, some object do not have clearly defined boundaries, e.g. it is not clear where a hand ends and forearm starts.