A Question Type Driven Framework to Diversify Visual Question Generation

Visual question generation aims at asking questions about an image automatically. Existing research works on this topic usually generate a single question for each given image without considering the issue of diversity. In this paper, we propose a question type driven framework to produce multiple questions for a given image with different focuses. In our framework, each question is constructed following the guidance of a sampled question type in a sequence-to-sequence fashion. To diversify the generated questions, a novel conditional variational auto-encoder is introduced to generate multiple questions with a specific question type. Moreover, we design a strategy to conduct the question type distribution learning for each image to select the final questions. Experimental results on three benchmark datasets show that our framework outperforms the state-of-the-art approaches in terms of both relevance and diversity.

[1]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[2]  Noah A. Smith,et al.  Proceedings of NIPS , 2010, NIPS 2010.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[6]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[7]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Rémi Eyraud,et al.  Proceedings of CoNLL , 2006 .

[10]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[11]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[12]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[13]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[14]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[15]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[16]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Piji Li,et al.  Deep Recurrent Generative Decoder for Abstractive Text Summarization , 2017, EMNLP.

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Tsukasa Hirashima,et al.  Automated Question Generation Methods for Intelligent English Learning Systems and its Evaluation , 2001 .

[20]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[22]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[23]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[24]  Noah A. Smith,et al.  Proceedings of EMNLP , 2007 .