Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the `recognition' and `localization' of an unseen category. To address this limitation, we introduce a new \emph{`Zero-Shot Detection'} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  Xiaoyong Du,et al.  Zero-shot Image Tagging by Hierarchical Semantic Embedding , 2015, SIGIR.

[3]  Qiang Ji,et al.  A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Rainer Stiefelhagen,et al.  Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yanan Li,et al.  Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[8]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Philip H. S. Torr,et al.  Straight to Shapes: Real-Time Detection of Encoded Shapes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[11]  Sanja Fidler,et al.  Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yuhong Guo,et al.  Zero-Shot Classification with Discriminative Semantic Representation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Bernt Schiele,et al.  Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Arnold W. M. Smeulders,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Shih-Fu Chang,et al.  Designing Category-Level Attributes for Discriminative Visual Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Fatih Porikli,et al.  A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning , 2017, IEEE Transactions on Image Processing.

[24]  Tao Xiang,et al.  Zero-Shot Learning on Semantic Class Prototype Graph , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Frédéric Jurie,et al.  Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication , 2016, ECCV.

[26]  Yuri Owechko,et al.  Zero Shot Learning via Multi-scale Manifold Regularization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nazli Ikizler-Cinbis,et al.  Attributes2Classname: A Discriminative Model for Attribute-Based Unsupervised Zero-Shot Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[30]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[31]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[33]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[34]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[35]  Tao Xiang,et al.  Learning a Deep Embedding Model for Zero-Shot Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yuji Matsumoto,et al.  Ridge Regression, Hubness, and Zero-Shot Learning , 2015, ECML/PKDD.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[39]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Yang Yang,et al.  Matrix Tri-Factorization with Manifold Regularizations for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[42]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[44]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tao Xiang,et al.  Transductive Multi-label Zero-shot Learning , 2014, BMVC.

[48]  Cees Snoek,et al.  Attributes Make Sense on Segmented Objects , 2014, ECCV.

[49]  Bernt Schiele,et al.  Multi-cue Zero-Shot Learning with Strong Supervision , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Shaogang Gong,et al.  Unsupervised Domain Adaptation for Zero-Shot Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Shaogang Gong,et al.  Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[53]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[54]  Nuno Vasconcelos,et al.  Semantically Consistent Regularization for Zero-Shot Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).