Dual Learning for Visual Question Generation

Recently, automatic answering of visually related questions (VQA) has gained a lot of attention in computer vision community. However, there is little work on automatically generating questions for images (VQG). Actually, VQG itself closes the loop to question-answering and diverse questions, which is useful to the research on VQA. Motivated by the assumption that learning to answer questions may boost the question generation, in this paper, we introduce the VQA task as the complementary of our primary VQG task, and propose a novel model that uses dual learning framework to jointly learn the dual tasks. In the framework, we devise an agent for VQG and VQA with pre-trained models respectively, and the learning tasks of the two agents form a closed loop, whose objectives are optimized together to guide each other via a reinforcement learning process. Specific rewards for each task are designed to update the models of the agents with policy gradient method. The relation of these two tasks can be exploited to further improve the performance of the primary VQG task. Extensive experiments conducted on two large-scale datasets show that the proposed method is capable to generate grounded visual questions of sufficient coverage and outperforms previous VQG methods on standard measures.

[1]  Tao Qin,et al.  Question Answering and Question Generation as Dual Tasks , 2017, ArXiv.

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[5]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[8]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  Wei Xu,et al.  Dual Learning for Cross-domain Image Captioning , 2017, CIKM.

[12]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bolei Zhou,et al.  Visual Question Generation as Dual Task of Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[15]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[20]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.