论文信息 - A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations

As transparency becomes key for robotics and AI, it will be necessary to evaluate the methods through which transparency is provided, including automatically generated natural language (NL) explanations. Here, we explore parallels between the generation of such explanations and the much-studied field of evaluation of Natural Language Generation (NLG). Specifically, we investigate which of the NLG evaluation measures map well to explanations. We present the ExBAN corpus: a crowd-sourced corpus of NL explanations for Bayesian Networks. We run correlations comparing human subjective ratings with NLG automatic measures. We find that embedding-based automatic NLG evaluation methods, such as BERTScore and BLEURT, have a higher correlation with human ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work has implications for Explainable AI and transparent robotic and autonomous systems.

Helen Hastie | Arash Eshghi | Miruna Clinciu

[1] Tim Miller,et al. Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences , 2017, ArXiv.

[2] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3] Sivaji Bandyopadhyay,et al. Statistical Natural Language Generation from Tabular Non-textual Data , 2016, INLG.

[4] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[6] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[7] Maartje M. A. de Graaf,et al. How People Explain Action (and Autonomous Intelligent Systems Should Too) , 2017, AAAI Fall Symposia.

[8] Verena Rieser,et al. Fact-based Content Weighting for Evaluating Abstractive Summarisation , 2020, ACL.

[9] Helen F. Hastie,et al. Explainable Autonomy: A Study of Explanation Styles for Building Clear Mental Models , 2018, INLG.

[10] David B. Leake. Evaluating Explanations , 1988, AAAI.

[11] David Hardcastle,et al. Can we Evaluate the Quality of Generated Text? , 2008, LREC.

[12] Changhe Yuan,et al. Most Relevant Explanation in Bayesian Networks , 2011, J. Artif. Intell. Res..

[13] Chris Mellish,et al. Evaluation in the context of natural language generation , 1998, Comput. Speech Lang..

[14] H. Hastie,et al. A Survey of Explainable AI Terminology , 2019, Proceedings of the 1st Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence (NL4XAI 2019).

[15] Tim Miller,et al. A Grounded Interaction Protocol for Explainable Artificial Intelligence , 2019, AAMAS.

[16] Helen F. Hastie,et al. A Comparative Evaluation Methodology for NLG in Interactive Systems , 2014, LREC.

[17] Zhenchang Xing,et al. AnswerBot: Automated generation of answer summary to developers' technical questions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18] Kevin B. Korb,et al. Anomaly detection in vessel tracks using Bayesian networks , 2014, Int. J. Approx. Reason..

[19] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[20] Emiel Krahmer,et al. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[21] Izak Benbasat,et al. Explanations From Intelligent Systems: Theoretical Foundations and Implications for Practice , 1999, MIS Q..

[22] Eunsol Choi,et al. QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[23] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24] Wassila Ouerdane,et al. Some Insights Towards a Unified Semantic Representation of Explanation for eXplainable Artificial Intelligence , 2019 .

[25] Silvia Metelli,et al. On Bayesian new edge prediction and anomaly detection in computer networks , 2019 .

[26] Anja Belz,et al. System Building Cost vs. Output Quality in Data-to-Text Generation , 2009, ENLG.