Comparing Automatic and Human Evaluation of NLG Systems

We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NI ST, B LEU, and ROUGE. We find that NI ST scores correlate best (>0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain.

[1]  Ehud Reiter,et al.  Evaluation of an NLG System using Post-Edit Data: Lessons Learnt , 2005, ENLG.

[2]  Jim Hunter,et al.  Exploiting a parallel TEXT - DATA corpus , 2003 .

[3]  Michael Strube,et al.  Classification-Based Generation Using TAG , 2004, INLG.

[4]  Srinivas Bangalore,et al.  Evaluation Metrics for Generation , 2000, INLG.

[5]  James C. Lester,et al.  Developing and Empirically Evaluating Robust Explanation Generators: The KNIGHT Experiments , 1997, Comput. Linguistics.

[6]  Jim Hunter,et al.  Choosing words in computer-generated weather forecasts , 2005, Artif. Intell..

[7]  Ehud Reiter,et al.  Generating Readable Texts for Readers with Low Basic Skills , 2005, ENLG.

[8]  Chris Mellish,et al.  Evaluation in the context of natural language generation , 1998, Comput. Speech Lang..

[9]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[10]  Anja Belz,et al.  Statistical Generation: Three Methods Compared and Evaluated , 2005, ENLG.

[11]  Irene Langkilde-Geary,et al.  An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator , 2002, INLG.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  R. Michael Young,et al.  Using Grice's maxim of Quantity to select the content of plan descriptions , 1999, Artif. Intell..

[14]  Jon Oberlander,et al.  Data-Driven Generation of Emphatic Facial Displays , 2006, EACL.

[15]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[16]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[17]  Jon Oberlander,et al.  IN PROCEEDINGS OF EACL-2006 , 2006 .

[18]  Nizar Habash The Use of a Structural N-gram Language Model in Generation-Heavy Hybrid Machine Translation , 2004, INLG.

[19]  José Coch Evaluating and comparing three text-production techniques , 1996, COLING.

[20]  E. Reiter,et al.  Acquiring Correct Knowledge for Natural Language Generation , 2011, J. Artif. Intell. Res..

[21]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[22]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[23]  Ehud Reiter,et al.  Should Corpora Texts Be Gold Standards for NLG? , 2002, INLG.