论文信息 - Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent unit (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. However, it is challenging to construct a parameter prediction network for a large number of parameters in the fully-connected dynamic parameter layer of the CNN. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network-joint network with the CNN for ImageQA and the parameter prediction network-is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.

[1] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[4] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[5] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[6] Yixin Chen,et al. Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[7] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[8] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[10] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[11] Sanja Fidler,et al. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[15] Sanja Fidler,et al. Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[17] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[19] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[20] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[21] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22] Lin Ma,et al. Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[23] Iasonas Kokkinos,et al. Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .

[25] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[28] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.