Zero-Shot Object Detection with Textual Descriptions Using Convolutional Neural Networks

Zero-shot object detection aims to detect and recognize objects unobserved in training samples from images. Previous studies generally utilized concept names or textual descriptions to build relationships between seen and unseen classes. However, these works rarely exploited the valuable information in textual descriptions for optimizing the network. Actually, textual descriptions contain much valuable information related to categories. Exploiting this information can help training the network and improve the detection performance. Besides, textual descriptions usually contain the names of objects that need to be detected. By using this character, we can narrow the scope of candidate unseen categories, thus can improve the detection accuracy. In this regard, we propose a novel framework that incorporates both images and their text descriptions for zero-shot object detection. In particular, we employ text convolutional neural network (CNN) and Faster R-CNN to extract text features and image features respectively, and combine them to optimize the regions that contain objects in images and to classify those newly detected objects simultaneously. Besides, we try extracting potential object labels directly from textual descriptions and introducing online hard example mining (OHEM) to assist with object classification and network optimization. Our extensive experiments on two public datasets demonstrate the superior performance of our approach to state-of-the-art methods.

