Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics

Abstract End-to-end dialog systems are gaining interest due to the recent advances of deep neural networks and the availability of large human–human dialog corpora. However, in spite of being of fundamental importance to systematically improve the performance of this kind of systems, automatic evaluation of the generated dialog utterances is still an unsolved problem. Indeed, most of the proposed objective metrics shown low correlation with human evaluations. In this paper, we evaluate a two-dimensional evaluation metric that is designed to operate at sentence level, which considers the syntactic and semantic information carried along the answers generated by an end-to-end dialog system with respect to a set of references. The proposed metric, when applied to outputs generated by the systems participating in track 2 of the DSTC-6 challenge, shows a higher correlation with human evaluations (up to 12.8% relative improvement at the system level) than the best of the alternative state-of-the-art automatic metrics currently available.

[1]  Haizhou Li,et al.  AM-FM: A Semantic Framework for Translation Quality Assessment , 2011, ACL.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Alois Knoll,et al.  Comparing Objective and Subjective Measures of Usability in a Human-Robot Dialogue System , 2009, ACL.

[5]  Y-Lan Boureau,et al.  Overview of the sixth dialog system technology challenge: DSTC6 , 2019, Comput. Speech Lang..

[6]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[7]  George D. C. Cavalcanti,et al.  Combining sentence similarities measures to identify paraphrases , 2018, Comput. Speech Lang..

[8]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Michael White,et al.  Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.

[12]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[13]  Preslav Nakov,et al.  Machine Translation Evaluation with Neural Networks , 2017, Comput. Speech Lang..

[14]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[15]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[16]  Matthew Marge,et al.  Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Haizhou Li,et al.  Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[20]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[21]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[22]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[23]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[24]  Robert Graham,et al.  Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI) , 2000, Natural Language Engineering.

[25]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[26]  Helen Hastie,et al.  Metrics and Evaluation of Spoken Dialogue Systems , 2012 .

[27]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[28]  Dunja Mladenic,et al.  Constructing a Natural Language Inference dataset using generative neural networks , 2016, Comput. Speech Lang..

[29]  David R. Traum,et al.  Dialogues in Context: An Objective User-Oriented Evaluation Approach for Virtual Human Dialogue , 2010, LREC.

[30]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[31]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[32]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[33]  Kallirroi Georgila,et al.  Hybrid reinforcement/supervised learning for dialogue policies from COMMUNICATOR data , 2005 .

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[36]  Romain Laroche,et al.  A methodology for turn-taking capabilities enhancement in Spoken Dialogue Systems using Reinforcement Learning , 2018, Comput. Speech Lang..

[37]  Steve J. Young,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006, The Knowledge Engineering Review.

[38]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[39]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[40]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[41]  Miles Osborne,et al.  The Edinburgh Twitter Corpus , 2010, HLT-NAACL 2010.

[42]  Rafael E. Banchs Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.