Human Evaluation of a German Surface Realisation Ranker

In this paper we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but there are also clearly factors that make certain realisation alternatives more natural.

[1]  Andy Way,et al.  Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[2]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[3]  Eduard Hovy,et al.  Evaluating DUC 2005 using Basic Elements , 2005 .

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Ani Nenkova,et al.  Structural Features for Predicting the Linguistic Quality of Text - Applications to Machine Translation, Automatic Summarization and Human-Authored Text , 2009, Empirical Methods in Natural Language Generation.

[6]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[7]  Stephan Oepen,et al.  Statistical Ranking in Tactical Generation , 2006, EMNLP.

[8]  Jun'ichi Tsujii,et al.  Probabilistic Models for Disambiguation of an HPSG-Based Chart Generator , 2005, IWPT.

[9]  Anja Belz,et al.  Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.

[10]  Albert Gatt,et al.  Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges , 2010, Empirical Methods in Natural Language Generation.

[11]  Ehud Reiter,et al.  Should Corpora Texts Be Gold Standards for NLG? , 2002, INLG.

[12]  Michael Strube,et al.  Generating Constituent Order in German Clauses , 2007, ACL.

[13]  Karolina Owczarzak DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries , 2009, ACL/IJCNLP.

[14]  Anja Belz,et al.  Assessing the Trade-Off between System Building Cost and Output Quality in Data-to-Text Generation , 2010, Empirical Methods in Natural Language Generation.

[15]  Emiel Krahmer,et al.  Introducing shared task evaluation to NLG : The TUNA shared task evaluation challenges , 2010 .

[16]  Aoife Cahill,et al.  Incorporating Information Status into Generation Ranking , 2009, ACL/IJCNLP.

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Erik Velldal,et al.  Empirical Realization Ranking , 2009 .

[19]  Joakim Nivre,et al.  A Dependency-Driven Parser for German Dependency and Constituency Representations , 2008, ACL 2008.

[20]  J. Bresnan Lexical-Functional Syntax , 2000 .

[21]  Matthew Marge,et al.  Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.

[22]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[23]  Christian Rohrer,et al.  Improving coverage and parsing quality of a large-scale LFG for German , 2006, LREC.

[24]  Srinivas Bangalore,et al.  Evaluation Metrics for Generation , 2000, INLG.

[25]  Aoife Cahill,et al.  Stochastic Realisation Ranking for a Free Word Order Language , 2007, ENLG.