论文信息 - Why Does a Visual Question Have Different Answers?

Why Does a Visual Question Have Different Answers?

Visual question answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. To our knowledge, this is the first work that aims to understand why. We propose a taxonomy of nine plausible reasons, and create two labelled datasets consisting of ~45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question which reasons will cause answer differences as well as a novel algorithm for this purpose. Experiments demonstrate the advantage of our approach over several related baselines on two diverse datasets. We publicly share the datasets and code at https://vizwiz.org.

[1] Chun-Ju Yang,et al. Visual Question Answer Diversity , 2018, HCOMP.

[2] Chris Welty,et al. Detection , Representation , and Exploitation of Events in the Semantic Web , 2012 .

[3] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5] Jiebo Luo,et al. VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Daniel Hernández-Lobato,et al. Ambiguity Helps: Classification with Disagreements in Crowdsourced Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Lora Aroyo,et al. Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..

[8] Lora Aroyo,et al. Measuring crowd truth: disagreement metrics combined with worker behavior filters , 2013 .

[9] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10] Lora Aroyo,et al. Domain-Independent Quality Measures for Crowd Truth Disagreement , 2013, DeRiVE@ISWC.

[11] Chen Huang,et al. Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13] Ashwin K. Vijayakumar,et al. We are Humor Beings: Understanding and Predicting Visual Humor , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Kristen Grauman,et al. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question , 2017, CHI.

[15] Jeroen B. P. Vuurens,et al. How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[16] Arjen P. de Vries,et al. Obtaining High-Quality Relevance Judgments Using Crowdsourcing , 2012, IEEE Internet Computing.

[17] Stefan Lee,et al. The Promise of Premise: Harnessing Question Premises in Visual Question Answering , 2017, EMNLP.

[18] Devi Parikh,et al. Image specificity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] W. Ahn,et al. The meaning and computation of causal power: comment on Cheng (1997) and Novick and Cheng (2004). , 2005, Psychological review.

[20] Anton van den Hengel,et al. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Adriana Kovashka,et al. Discovering Attribute Shades of Meaning with the Crowd , 2014, International Journal of Computer Vision.

[22] Ehsan Amid,et al. Multiview Triplet Embedding: Learning Attributes in Multiple Maps , 2015, ICML.

[23] Christopher Kanan,et al. An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[25] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Margrit Betke,et al. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s) , 2017, International Journal of Computer Vision.

[27] Jianxiong Xiao,et al. What makes an image memorable , 2011 .

[28] Matthew Lease,et al. Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings , 2016, HCOMP.

[29] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30] Rob Miller,et al. VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[31] Erin Brady,et al. Visual challenges in the everyday lives of blind people , 2013, CHI.

[32] Arjen P. de Vries,et al. Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[33] Chris Welty,et al. Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[34] Stefan Dietze,et al. Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys , 2015, CHI.

[35] Mengting Wan,et al. Modeling Ambiguity, Subjectivity, and Diverging Viewpoints in Opinion Question Answering Systems , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[36] Jeffrey P. Bigham,et al. Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities , 2012, ASSETS '12.

[37] Ido Dagan,et al. Crowdsourcing Question-Answer Meaning Representations , 2017, NAACL.

[38] Pietro Perona,et al. The Multidimensional Wisdom of Crowds , 2010, NIPS.

[39] Alessandro Bozzon,et al. Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing , 2017, HT.

[40] Matthew Lease,et al. SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[41] Mohammad Rahmati,et al. Agreement/disagreement based crowd labeling , 2014, Applied Intelligence.

[42] Lora Aroyo,et al. CrowdTruth: Machine-Human Computation Framework for Harnessing Disagreement in Gathering Annotated Data , 2014, SEMWEB.

[43] P. Cheng. From covariation to causation: A causal power theory. , 1997 .

[44] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).