How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
暂无分享,去创建一个
Joelle Pineau | Iulian Serban | Ryan Lowe | Chia-Wei Liu | Laurent Charlin | Michael Noseworthy | Joelle Pineau | Iulian Serban | Laurent Charlin | Ryan Lowe | Michael Noseworthy | Chia-Wei Liu
[1] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.
[2] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[3] Joelle Pineau,et al. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.
[4] Anja Belz,et al. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.
[5] Marilyn A. Walker,et al. PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.
[6] Jianfeng Gao,et al. A Persona-Based Neural Conversation Model , 2016, ACL.
[7] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[8] David Vandyke,et al. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.
[9] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.
[10] Anton Leuski,et al. Semi-formal Evaluation of Conversational Characters , 2009, Languages: From Formal to Natural.
[11] Jianfeng Gao,et al. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.
[12] Stephanie Seneff,et al. Spoken Dialogue Systems , 2008 .
[13] Tomoki Toda,et al. Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System , 2014, IEICE Trans. Inf. Syst..
[14] Timothy Baldwin,et al. Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.
[15] Sebastian Möller,et al. Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations , 2006, INTERSPEECH.
[16] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.
[17] Peter W. Foltz,et al. The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .
[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[19] Philipp Koehn,et al. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.
[20] Alan Ritter,et al. Unsupervised Modeling of Twitter Conversations , 2010, NAACL.
[21] Aoife Cahill. Correlating Human and Automatic Evaluation of a German Surface Realiser , 2009, ACL/IJCNLP.
[22] C Kamm,et al. User Interfaces for voice applications , 1994 .
[23] Kallirroi Georgila,et al. Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems , 2005, SIGDIAL.
[24] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[25] Vasile Rus,et al. A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.
[26] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[27] Mitchell P. Marcus. Proceedings of the second international conference on Human Language Technology Research , 2002 .
[28] Joelle Pineau,et al. Bootstrapping Dialog Systems with Word Embeddings , 2014 .
[29] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.
[30] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.
[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[32] Joelle Pineau,et al. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.
[33] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[34] Jacob Cohen,et al. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .
[35] Florence Reeder,et al. Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results , 2002 .
[36] Philipp Koehn,et al. Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.
[37] Colin Cherry,et al. A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.
[38] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .
[39] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.
[40] Philipp Koehn,et al. Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.
[41] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.
[42] Michael White,et al. Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.
[43] Alan Ritter,et al. Data-Driven Response Generation in Social Media , 2011, EMNLP.
[44] Matthew Marge,et al. Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.
[45] Joelle Pineau,et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems , 2015, SIGDIAL Conference.
[46] Mirella Lapata,et al. Vector-based Models of Semantic Composition , 2008, ACL.