The Wisdom of MaSSeS: Majority, Subjectivity, and Semantic Similarity in the Evaluation of VQA

We introduce MASSES, a simple evaluation metric for the task of Visual Question Answering (VQA). In its standard form, the VQA task is operationalized as follows: Given an image and an open-ended question in natural language, systems are required to provide a suitable answer. Currently, model performance is evaluated by means of a somehow simplistic metric: If the predicted answer is chosen by at least 3 human annotators out of 10, then it is 100% correct. Though intuitively valuable, this metric has some important limitations. First, it ignores whether the predicted answer is the one selected by the Majority (MA) of annotators. Second, it does not account for the quantitative Subjectivity (S) of the answers in the sample (and dataset). Third, information about the Semantic Similarity (SES) of the responses is completely neglected. Based on such limitations, we propose a multi-component metric that accounts for all these issues. We show that our metric is effective in providing a more fine-grained evaluation both on the quantitative and qualitative level.

[1]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[4]  Radu Soricut,et al.  Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task , 2016, ArXiv.

[5]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Chun-Ju Yang,et al.  Visual Question Answer Diversity , 2018, HCOMP.

[8]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Christopher Ré,et al.  Building a Large-scale Multimodal Knowledge Base for Visual Question Answering , 2015, ArXiv.

[13]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[14]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[15]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[16]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[17]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Marco Cuturi,et al.  On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests , 2015, Entropy.

[19]  Jiebo Luo,et al.  VizWiz Grand Challenge: Answering Visual Questions from Blind People , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[21]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22]  Matthew Lease,et al.  SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[23]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[24]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Hexiang Hu,et al.  Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets , 2017, NAACL.

[26]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[27]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[28]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[30]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[33]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  José M. F. Moura,et al.  Visual Dialog , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Saurabh Gupta,et al.  Exploring Nearest Neighbor Approaches for Image Captioning , 2015, ArXiv.

[36]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[37]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[38]  James B. Orlin,et al.  A polynomial time primal network simplex algorithm for minimum cost flows , 1996, SODA '96.

[39]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Kristen Grauman,et al.  CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question , 2017, CHI.

[41]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[43]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[44]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[45]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[47]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).