A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanations for Bayesian Networks. We run correlations comparing human subjective ratings with NLG automatic measures. We find that embedding-based automatic NLG evaluation methods, such as BERTScore and BLEURT, have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work has implications for Explainable AI and transparent robotic and autonomous systems.

[1]  Tim Miller,et al.  Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences , 2017, ArXiv.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Sivaji Bandyopadhyay,et al.  Statistical Natural Language Generation from Tabular Non-textual Data , 2016, INLG.

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[6]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[7]  Maartje M. A. de Graaf,et al.  How People Explain Action (and Autonomous Intelligent Systems Should Too) , 2017, AAAI Fall Symposia.

[8]  Verena Rieser,et al.  Fact-based Content Weighting for Evaluating Abstractive Summarisation , 2020, ACL.

[9]  Helen F. Hastie,et al.  Explainable Autonomy: A Study of Explanation Styles for Building Clear Mental Models , 2018, INLG.

[10]  David B. Leake Evaluating Explanations , 1988, AAAI.

[11]  David Hardcastle,et al.  Can we Evaluate the Quality of Generated Text? , 2008, LREC.

[12]  Changhe Yuan,et al.  Most Relevant Explanation in Bayesian Networks , 2011, J. Artif. Intell. Res..

[13]  Chris Mellish,et al.  Evaluation in the context of natural language generation , 1998, Comput. Speech Lang..

[14]  H. Hastie,et al.  A Survey of Explainable AI Terminology , 2019, Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019).

[15]  Tim Miller,et al.  A Grounded Interaction Protocol for Explainable Artificial Intelligence , 2019, AAMAS.

[16]  Helen F. Hastie,et al.  A Comparative Evaluation Methodology for NLG in Interactive Systems , 2014, LREC.

[17]  Zhenchang Xing,et al.  AnswerBot: Automated generation of answer summary to developers' technical questions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Kevin B. Korb,et al.  Anomaly detection in vessel tracks using Bayesian networks , 2014, Int. J. Approx. Reason..

[19]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[20]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[21]  Izak Benbasat,et al.  Explanations From Intelligent Systems: Theoretical Foundations and Implications for Practice , 1999, MIS Q..

[22]  Eunsol Choi,et al.  QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[23]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24]  Wassila Ouerdane,et al.  Some Insights Towards a Unified Semantic Representation of Explanation for eXplainable Artificial Intelligence , 2019 .

[25]  Silvia Metelli,et al.  On Bayesian new edge prediction and anomaly detection in computer networks , 2019 .

[26]  Anja Belz,et al.  System Building Cost vs. Output Quality in Data-to-Text Generation , 2009, ENLG.

[27]  Karen M. Feigh,et al.  Learning From Explanations Using Sentiment and Advice in RL , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[28]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[29]  Amit Dhurandhar,et al.  One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques , 2019, ArXiv.

[30]  Ales Horák,et al.  On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution? , 2016, ICAART.

[31]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Carrie J. Cai,et al.  The effects of example-based explanations in a machine learning interface , 2019, IUI.

[33]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[34]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[35]  Klaus Krippendorff,et al.  Metodología de análisis de contenido : teoría y práctica , 1990 .

[36]  Masayu Leylia Khodra,et al.  Automatic Summarization of Tweets in Providing Indonesian Trending Topic Explanation , 2013 .

[37]  Verena Rieser,et al.  Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge , 2019, Comput. Speech Lang..

[38]  Siddhartha S. Srinivasa,et al.  Natural Language Explanations in Human-Collaborative Systems , 2017, HRI.

[39]  Laurence Capus,et al.  Learning Summarization by Using Similarities , 1998 .

[40]  Clayton T. Morrison,et al.  WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-hop Inference , 2018, LREC.

[41]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[42]  Dimitra Gkatzia,et al.  A Snapshot of NLG Evaluation Practices 2005 - 2014 , 2015, ENLG.

[43]  Peter A. Flach,et al.  Counterfactual Explanations of Machine Learning Predictions: Opportunities and Challenges for AI Safety , 2019, SafeAI@AAAI.

[44]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[45]  Sawan Kumar,et al.  NILE : Natural Language Inference with Faithful Natural Language Explanations , 2020, ACL.

[46]  T. Lombrozo,et al.  Simplicity and probability in causal explanation , 2007, Cognitive Psychology.

[47]  Ion Androutsopoulos,et al.  Using Integer Linear Programming for Content Selection, Lexicalization, and Aggregation to Produce Compact Texts from OWL Ontologies , 2013, ENLG.

[48]  Simon Mille,et al.  Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing , 2020, INLG.

[49]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[50]  Hans-Holger Herrnfeld,et al.  Article 56 Automated individual decision-making, including profiling , 2021 .

[51]  Natalie Schluter,et al.  The limits of automatic summarisation according to ROUGE , 2017, EACL.

[52]  Jeffrey C. Zemla,et al.  Evaluating everyday explanations , 2017, Psychonomic bulletin & review.

[53]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[54]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[55]  Emiel Krahmer,et al.  PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences , 2017, INLG.

[56]  Annemarie Sullivan Palincsar,et al.  The Role of Dialogue in Providing Scaffolded Instruction , 1986 .

[57]  Helen F. Hastie,et al.  Cluster-based Prediction of User Ratings for Stylistic Surface Realisation , 2014, EACL.

[58]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[59]  Artur S. d'Avila Garcez,et al.  Measurable Counterfactual Local Explanations for Any Classifier , 2019, ECAI.

[60]  Zaid Tashman,et al.  Anomaly Detection System for Water Networks in Northern Ethiopia Using Bayesian Inference , 2020, Sustainability.

[61]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[62]  David Maxwell,et al.  A Study of Snippet Length and Informativeness: Behaviour, Performance and User Experience , 2017, SIGIR.

[63]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[64]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  Advaith Siddharthan,et al.  SaferDrive: An NLG-based behaviour change support system for drivers , 2018, Natural Language Engineering.

[67]  Hossein Amirkhani,et al.  Anomaly Detection in Smart Homes Using Bayesian Networks , 2020, KSII Trans. Internet Inf. Syst..

[68]  Stuart Russell Human Compatible: Artificial Intelligence and the Problem of Control , 2019 .

[69]  John-Jules Ch. Meyer,et al.  A Study into Preferred Explanations of Virtual Agent Behavior , 2009, IVA.

[70]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[71]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[72]  Daniel Deutch,et al.  NLProv: Natural Language Provenance , 2016, Proc. VLDB Endow..

[73]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[74]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[75]  Denali Molitor,et al.  Model Agnostic Supervised Local Explanations , 2018, NeurIPS.

[76]  Marko Grobelnik,et al.  Question Answering Based on Semantic Graphs , 2009 .

[77]  Anja Belz,et al.  Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.

[78]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[79]  Grinell Smith Does Gender Influence Online Survey Participation? A Record-Linkage Analysis of University Faculty Online Survey Response Behavior. , 2008 .

[80]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.