F1 Is Not Enough! Models and Evaluation towards User-Centered Explainable Question Answering

Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.

[1]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[2]  Greg Durrett,et al.  Multi-hop Question Answering via Reasoning Chains , 2019, ArXiv.

[3]  Ting Liu,et al.  Is Graph Structure Necessary for Multi-hop Reasoning? , 2020, ArXiv.

[4]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[5]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[6]  Oluwasanmi Koyejo,et al.  Examples are not enough, learn to criticize! Criticism for Interpretability , 2016, NIPS.

[7]  Anind K. Dey,et al.  Why and why not explanations improve the intelligibility of context-aware intelligent systems , 2009, CHI.

[8]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[9]  Kyunghyun Cho,et al.  Unsupervised Question Decomposition for Question Answering , 2020, EMNLP.

[10]  N. Wiratunga,et al.  Towards Explainable Text Classification by Jointly Learning Lexicon and Modifier Terms , 2017 .

[11]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[12]  Dympna O'Sullivan,et al.  The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems , 2015, 2015 International Conference on Healthcare Informatics.

[13]  Zijian Wang,et al.  Answering Complex Open-domain Questions Through Iterative Query Generation , 2019, EMNLP.

[14]  Richard Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[15]  Hannaneh Hajishirzi,et al.  Multi-hop Reading Comprehension through Question Decomposition and Rescoring , 2019, ACL.

[16]  Lei Li,et al.  Dynamically Fused Graph Network for Multi-hop Reasoning , 2019, ACL.

[17]  Todd Kulesza,et al.  Tell me more?: the effects of mental model soundness on personalizing an intelligent agent , 2012, CHI.

[18]  Bill Byrne,et al.  An Operation Sequence Model for Explainable Neural Machine Translation , 2018, BlackboxNLP@EMNLP.

[19]  Yankai Lin,et al.  Multi-Paragraph Reasoning with Knowledge-enhanced Graph Neural Network , 2019, ArXiv.

[20]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[23]  Masaaki Nagata,et al.  Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction , 2019, ACL.

[24]  Rashmi R. Sinha,et al.  The role of transparency in recommender systems , 2002, CHI Extended Abstracts.

[25]  Xuanjing Huang,et al.  A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation , 2016, ACL.

[26]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[27]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[28]  Lora Aroyo,et al.  The effects of transparency on trust in and acceptance of a content-based art recommender , 2008, User Modeling and User-Adapted Interaction.

[29]  Ming Tu,et al.  Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents , 2020, AAAI.

[30]  Paul N. Bennett,et al.  Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention , 2020, ICLR.

[31]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[32]  Eric D. Ragan,et al.  The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems , 2019, HCOMP.

[33]  Mohan S. Kankanhalli,et al.  Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda , 2018, CHI.

[34]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[35]  Kathleen McKeown,et al.  Human-Centric Justification of Machine Learning Predictions , 2017, IJCAI.

[36]  Zhe Gan,et al.  Hierarchical Graph Network for Multi-hop Question Answering , 2019, EMNLP.

[37]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[38]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[39]  Graham Neubig,et al.  Differentiable Reasoning over a Virtual Knowledge Base , 2020, ICLR.

[40]  Niels Henze,et al.  Input Controls for Entering Uncertain Data: Probability Distribution Sliders , 2017, PACMHCI.

[41]  Niels Henze,et al.  Detecting uncertain input using physiological sensing and behavioral measurements , 2017, MUM.

[42]  Samuel J. Gershman,et al.  Human Evaluation of Models Built for Interpretability , 2019, HCOMP.

[43]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[44]  John Riedl,et al.  Explaining collaborative filtering recommendations , 2000, CSCW '00.

[45]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.