Using Robust Networks to Inform Lightweight Models in Semi-Supervised Learning for Object Detection

A common trade-off among object detection algorithms is accuracy-for-speed (or vice versa). To meet our application’s real-time requirement, we use a Single Shot MultiBox Detector (SSD) model. This architecture meets our latency requirements; however, a large amount of training data is required to achieve an acceptable accuracy level. While unusable for our end application, more robust network architectures, such as Regions with CNN features (R-CNN), provide an important advantage over SSD models—they can be more reliably trained on small datasets. By fine-tuning R-CNN models on a small number of hand-labeled examples, we create new, larger training datasets by running inference on the remaining unlabeled data. We show that these new, inferenced labels are beneficial to the training of lightweight models. These inferenced datasets are imperfect, and we explore various methods of dealing with the errors, including hand-labeling mislabeled data, discarding poor examples, and simply ignoring errors. Further, we explore the total cost, measured in human and computer time, required to execute this workflow compared to a hand-labeling baseline.

[1]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Martial Hebert,et al.  Watch and learn: Semi-supervised learning of object detectors from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[5]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[6]  Yang Wang,et al.  A weakly supervised approach for object detection based on Soft-Label Boosting , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Vishnu Naresh Boddeti,et al.  In Teacher We Trust: Learning Compressed Models for Pedestrian Detection , 2016, ArXiv.

[9]  Junjie Yan,et al.  Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yuxing Tang,et al.  Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Samuel Murray,et al.  Okutama-Action: An Aerial View Video Dataset for Concurrent Human Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[15]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jinke Yu,et al.  GAN-Knowledge Distillation for One-Stage Object Detection , 2019, IEEE Access.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.