A Metamorphic Testing Approach for Assessing Question Answering Systems

Question Answering (QA) enables the machine to understand and answer questions posed in natural language, which has emerged as a powerful tool in various domains. However, QA is a challenging task and there is an increasing concern about its quality. In this paper, we propose to apply the technique of metamorphic testing (MT) to evaluate QA systems from the users’ perspectives, in order to help the users to better understand the capabilities of these systems and then to select appropriate QA systems for their specific needs. Two typical categories of QA systems, namely, the textual QA (TQA) and visual QA (VQA), are studied, and a total number of 17 metamorphic relations (MRs) are identified for them. These MRs respectively focus on some characteristics of different aspects of QA. We further apply MT to four QA systems (including two APIs from the AllenNLP platform, one API from the Transformers platform, and one API from CloudCV) by using all of the MRs. Our experimental results demonstrate the capabilities of the four subject QA systems from various aspects, revealing their strengths and weaknesses. These results further suggest that MT can be an effective method for assessing QA systems.

[1]  Zuohua Ding,et al.  Input Test Suites for Program Repair: A Novel Construction Method Based on Metamorphic Relations , 2021, IEEE Transactions on Reliability.

[2]  Dave Towey,et al.  Metamorphic Relations for Enhancing System Understanding and Use , 2020, IEEE Transactions on Software Engineering.

[3]  Jianjun Hu,et al.  A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets , 2020, Applied Sciences.

[4]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[5]  Greg Durrett,et al.  Robust Question Answering Through Sub-part Alignment , 2020, NAACL.

[6]  Vinay P. Namboodiri,et al.  Robust Explanations for Visual Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[8]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[9]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[10]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[11]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Weiming Zhang,et al.  Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[14]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[15]  Liqun Sun,et al.  Metamorphic testing of driverless cars , 2019, Commun. ACM.

[16]  Xinlei Chen,et al.  Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Anton van den Hengel,et al.  Visual Question Answering as Reading Comprehension , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[19]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[20]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[21]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[22]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[23]  Huai Liu,et al.  Metamorphic Testing , 2018, ACM Comput. Surv..

[24]  Bernard Ghanem,et al.  A Novel Framework for Robustness Analysis of Visual QA Models , 2017, AAAI.

[25]  Suman Jana,et al.  DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[26]  Ming Zhou,et al.  SuperAgent: A Customer Service Chatbot for E-commerce Websites , 2017, ACL.

[27]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Zuohua Ding,et al.  A metamorphic testing approach for supporting program repair without the need for a test oracle , 2017, J. Syst. Softw..

[30]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[31]  Sergio Segura,et al.  A Survey on Metamorphic Testing , 2016, IEEE Transactions on Software Engineering.

[32]  Tsong Yueh Chen,et al.  Metamorphic Testing for Software Quality Assessment: A Study of Search Engines , 2016, IEEE Transactions on Software Engineering.

[33]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[34]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[35]  Baowen Xu,et al.  Metamorphic slice: An application in spectrum-based fault localization , 2013, Inf. Softw. Technol..

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Mimoun Malki,et al.  Question Answering Systems: Survey and Trends☆ , 2015 .