论文信息 - Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection. Code is available at: http://ytzhang.net/projects/dbnet.

[1] Jitendra Malik,et al. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[6] Bohyung Han,et al. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[10] Trevor Darrell,et al. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[13] Phil Blunsom,et al. A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[14] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[15] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[16] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[18] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[19] Koen E. A. van de Sande,et al. Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[20] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Yuting Zhang,et al. Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27] Trevor Darrell,et al. Segmentation from Natural Language Expressions , 2016, ECCV.

[28] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[29] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[31] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[32] Yi Li,et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[33] Xiaogang Wang,et al. DeepID-Net: Deformable deep convolutional neural networks for object detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[36] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[37] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[39] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[42] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[44] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[46] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[47] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48] Bernt Schiele,et al. Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[50] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Larry S. Davis,et al. Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[52] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[53] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[54] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[55] C. Lawrence Zitnick,et al. Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[56] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Hao Zhou,et al. Visual Relationship Detection with Relative Location Mining , 2019, ACM Multimedia.

[58] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[60] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.