Understanding the Impact of Experiment Design for Evaluating Dialogue System Output
暂无分享,去创建一个
[1] A. Sorace,et al. MAGNITUDE ESTIMATION OF LINGUISTIC ACCEPTABILITY , 1996 .
[2] Emiel Krahmer,et al. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..
[3] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.
[4] Jesse Hoey,et al. Affective Neural Response Generation , 2017, ECIR.
[5] J. Fleiss,et al. Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.
[6] Susan T. Fiske,et al. Scientists Making a Difference: One Hundred Eminent Behavioral and Brain Scientists Talk about Their Most Important Contributions , 2016 .
[7] Saif Mohammad,et al. Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation , 2017, ACL.
[8] Anja Belz,et al. Comparing Rating Scales and Preference Judgements in Language Evaluation , 2010, INLG.
[9] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[10] Osmar R. Zaïane,et al. Augmenting Neural Response Generation with Context-Aware Topical Attention , 2018, Proceedings of the First Workshop on NLP for Conversational AI.
[11] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.
[12] Anja Belz,et al. Discrete vs. Continuous Rating Scales for Language Evaluation in NLP , 2011, ACL.
[13] Rahul Goel,et al. On Evaluating and Comparing Conversational Agents , 2018, ArXiv.