A novel feature-based model for zero-shot object detection with simulated attributes

Zero-shot object detection (ZSD) has recently been proposed for detecting objects whose categories have never been seen during training. Existing ZSD works have some drawbacks: (a) the end-to-end methods sacrifice the mean accuracy precision (mAP) on seen classes; (b) the feature-based methods could avoid the above problem but suffer from simple feature construction. Thus, in this paper, we present a succinct but effective feature-based ZSD model whose feature construction naturally leverages the deep feature embedding of the detector itself as the visual features of the detected objects. The features we utilize, named “Detection Feature” (DetFeat), contain not only visual representations but also context and position information, which provide more discriminative information for seen and unseen objects. Additionally, we simulate the construction of the attributes defined by human experts to generate the specific label embedding for the ZSD task, named “Simulated Attributes” (Simu-Attr). We find that Simu-attr promotes better alignment between visual and semantic space for alleviating the problem of the semantic gap. Extensive experiments show that our approach improves the detection performance on unseen classes while maintaining the high detection performance on seen classes. On the challenging COCO dataset, we surpass the best existing transductive ZSD TL-ZSD with about 1% on unseen class and about 10% on seen class using mAP as metric.

[1]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Nick Barnes,et al.  Polarity Loss for Zero-shot Object Detection , 2018, ArXiv.

[3]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[4]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[5]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[6]  Nazli Ikizler-Cinbis,et al.  Zero-Shot Object Detection by Hybrid Region Embedding , 2018, BMVC.

[7]  Jian Sun,et al.  Object Detection Networks on Convolutional Feature Maps , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[12]  Lina Yao,et al.  Zero-Shot Object Detection with Textual Descriptions , 2019, AAAI.

[13]  Venkatesh Saligrama,et al.  Zero Shot Detection , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Fatih Murat Porikli,et al.  Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts , 2018, ACCV.

[17]  Liujuan Cao,et al.  Generalized Zero-Shot Vehicle Detection in Remote Sensing Imagery via Coarse-to-Fine Framework , 2019, IJCAI.

[18]  Lina Yao,et al.  Zero-Shot Object Detection via Learning an Embedding from Semantic Space to Visual Space , 2020, IJCAI.

[19]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[20]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Shafin Rahman,et al.  Transductive Learning for Zero-Shot Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.