Automatic Image Labeling with Click Supervision on Aerial Images

Manually generating annotated bounding boxes for object detection is time consuming. Although human-annotation is the most accurate approach, machine learning models can provide additional assistance. In this paper, we propose a human in a loop automatic image labeling framework focusing on aerial images with less features for detection. The proposed model consists of two main parts, prediction model and adjustment model. The user first provides click location to prediction model to generate a bounding box of a specific object. The bounding box is then fine-tuned by the adjustment model for more accurate size and location. A feedback and retrain mechanism is implemented that allows the users to manually adjust the generated bounding box and provide feedback to incrementally train the adjustment network during runtime. This unique online learning feature enables user to generalize existing model to target classes not initially presented in the training set, and gradually improves the specificity of the model to those new targets online. We demonstrate promising results on Neovision 2 Heli dataset. Compared to the state-of-the-art method, our prediction model achieves a higher detection rate, and our adjustment model improves the IOU by up to 45%

[1]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Frank Keller,et al.  We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[4]  Frank Keller,et al.  Training Object Class Detectors with Click Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[6]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[7]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[8]  Rui Huang,et al.  ClickBAIT-v2: Training an Object Detector in Real-Time , 2018, ArXiv.

[9]  Bo Han,et al.  TouchCut: Fast image and video segmentation using single-touch interaction , 2014, Comput. Vis. Image Underst..

[10]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[11]  Sim Heng Ong,et al.  Regional Interactive Image Segmentation Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[13]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[14]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Steven L. Waslander,et al.  Leveraging Pre-Trained 3D Object Detection Models for Fast Ground Truth Generation , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).