Why Does a Visual Question Have Different Answers?

Visual question answering is the task of returning the answer to a question about an image. A challenge is that different people often provide different answers to the same visual question. To our knowledge, this is the first work that aims to understand why. We propose a taxonomy of nine plausible reasons, and create two labelled datasets consisting of ~45,000 visual questions indicating which reasons led to answer differences. We then propose a novel problem of predicting directly from a visual question which reasons will cause answer differences as well as a novel algorithm for this purpose. Experiments demonstrate the advantage of our approach over several related baselines on two diverse datasets. We publicly share the datasets and code at https://vizwiz.org.

[1]  Chun-Ju Yang,et al.  Visual Question Answer Diversity , 2018, HCOMP.

[2]  Chris Welty,et al.  Detection , Representation , and Exploitation of Events in the Semantic Web , 2012 .

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Daniel Hernández-Lobato,et al.  Ambiguity Helps: Classification with Disagreements in Crowdsourced Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lora Aroyo,et al.  Crowdsourcing Ground Truth for Medical Relation Extraction , 2017, ACM Trans. Interact. Intell. Syst..

[8]  Lora Aroyo,et al.  Measuring crowd truth: disagreement metrics combined with worker behavior filters , 2013 .

[9]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Lora Aroyo,et al.  Domain-Independent Quality Measures for Crowd Truth Disagreement , 2013, DeRiVE@ISWC.

[11]  Chen Huang,et al.  Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Ashwin K. Vijayakumar,et al.  We are Humor Beings: Understanding and Predicting Visual Humor , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kristen Grauman,et al.  CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question , 2017, CHI.

[15]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[16]  Arjen P. de Vries,et al.  Obtaining High-Quality Relevance Judgments Using Crowdsourcing , 2012, IEEE Internet Computing.

[17]  Stefan Lee,et al.  The Promise of Premise: Harnessing Question Premises in Visual Question Answering , 2017, EMNLP.

[18]  Devi Parikh,et al.  Image specificity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  W. Ahn,et al.  The meaning and computation of causal power: comment on Cheng (1997) and Novick and Cheng (2004). , 2005, Psychological review.

[20]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Adriana Kovashka,et al.  Discovering Attribute Shades of Meaning with the Crowd , 2014, International Journal of Computer Vision.

[22]  Ehsan Amid,et al.  Multiview Triplet Embedding: Learning Attributes in Multiple Maps , 2015, ICML.

[23]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Margrit Betke,et al.  Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s) , 2017, International Journal of Computer Vision.

[27]  Jianxiong Xiao,et al.  What makes an image memorable , 2011 .

[28]  Matthew Lease,et al.  Probabilistic Modeling for Crowdsourcing Partially-Subjective Ratings , 2016, HCOMP.

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[31]  Erin Brady,et al.  Visual challenges in the everyday lives of blind people , 2013, CHI.

[32]  Arjen P. de Vries,et al.  Increasing cheat robustness of crowdsourcing tasks , 2013, Information Retrieval.

[33]  Chris Welty,et al.  Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[34]  Stefan Dietze,et al.  Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys , 2015, CHI.

[35]  Mengting Wan,et al.  Modeling Ambiguity, Subjectivity, and Diverging Viewpoints in Opinion Question Answering Systems , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[36]  Jeffrey P. Bigham,et al.  Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities , 2012, ASSETS '12.

[37]  Ido Dagan,et al.  Crowdsourcing Question-Answer Meaning Representations , 2017, NAACL.

[38]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[39]  Alessandro Bozzon,et al.  Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing , 2017, HT.

[40]  Matthew Lease,et al.  SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[41]  Mohammad Rahmati,et al.  Agreement/disagreement based crowd labeling , 2014, Applied Intelligence.

[42]  Lora Aroyo,et al.  CrowdTruth: Machine-Human Computation Framework for Harnessing Disagreement in Gathering Annotated Data , 2014, SEMWEB.

[43]  P. Cheng From covariation to causation: A causal power theory. , 1997 .

[44]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).