Heuristics2Annotate: Efficient Annotation of Large-Scale Marathon Dataset For Bounding Box Regression

Annotating a large-scale in-the-wild person reidentification dataset especially of marathon runners is a challenging task. The variations in the scenarios such as camera viewpoints, resolution, occlusion, and illumination make the problem non-trivial. Manually annotating bounding boxes in such large-scale datasets is cost-inefficient. Additionally, due to crowdedness and occlusion in the videos, aligning the identity of runners across multiple disjoint cameras is a challenge. We collected a novel large-scale in-the-wild video dataset of marathon runners. The dataset consists of hours of recording of thousands of runners captured using 42 hand-held smartphone cameras and covering real-world scenarios. Due to the presence of crowdedness and occlusion in the videos, the annotation of runners becomes a challenging task. We propose a new scheme for tackling the challenges in the annotation of such large dataset. Our technique reduces the overall cost of annotation in terms of time as well as budget. We demonstrate performing fps analysis to reduce the effort and time of annotation. We investigate several annotation methods for efficiently generating tight bounding boxes. Our results prove that interpolating bounding boxes between keyframes is the most efficient method of bounding box generation amongst several other methods and is 3x times faster than the naive baseline method. We introduce a novel way of aligning the identity of runners in disjoint cameras. Our inter-camera alignment tool integrated with the state-of-the-art person re-id system proves to be sufficient and effective in the alignment of the runners across multiple cameras with non-overlapping views. Our proposed framework of annotation reduces the annotation cost of the dataset by a factor of 16x, also effectively aligning 93.64% of the runners in the cross-camera setting. Index Terms — dataset, annotation, computer vision, cameras, benchmarks, object detection, object tracking, interpolation, marathon, bib detection, cross-camera alignment, person re-identification Figure 1: Sample images of videos recorded from multiple disjoint cameras. Different scenarios including variation in light intensity, occlusion, shadows, and different camera poses and angles make the task of runner detection non-trivial.

[1]  Agustí Verde Parera,et al.  General data protection regulation , 2018 .

[2]  Frank Keller,et al.  We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Heikki Huttunen,et al.  Faster Bounding Box Annotation for Object Detection in Indoor Scenes , 2018, 2018 7th European Workshop on Visual Information Processing (EUVIP).

[4]  Octavia I. Camps,et al.  DukeMTMC4ReID: A Large-Scale Multi-camera Person Re-identification Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[6]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Tao Xiang,et al.  Torchreid: A Library for Deep Learning Person Re-Identification in Pytorch , 2019, ArXiv.

[9]  Xiaogang Wang,et al.  Human Reidentification with Transferred Metric Learning , 2012, ACCV.

[10]  Michael Gygli,et al.  Efficient Object Annotation via Speaking and Pointing , 2019, International Journal of Computer Vision.

[11]  Santiago Manen,et al.  PathTrack: Fast Trajectory Annotation with Path Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Ziyan Wu,et al.  A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jan van Gemert,et al.  Running Event Visualization using Videos from Multiple Cameras , 2019, MMSports '19.

[15]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[16]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[17]  Xiaogang Wang,et al.  Locally Aligned Feature Transforms across Views , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.