论文信息 - Rating Computer-Generated Questions with Mechanical Turk

Rating Computer-Generated Questions with Mechanical Turk

We use Amazon Mechanical Turk to rate computer-generated reading comprehension questions about Wikipedia articles. Such application-specific ratings can be used to train statistical rankers to improve systems' final output, or to evaluate technologies that generate natural language. We discuss the question rating scheme we developed, assess the quality of the ratings that we gathered through Amazon Mechanical Turk, and show evidence that these ratings can be used to improve question generation.

Noah A. Smith | Michael Heilman | Michael Heilman

[1] J. R. Landis,et al. The measurement of observer agreement for categorical data. , 1977, Biometrics.

[2] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[3] Glenn Carroll,et al. Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[4] Marilyn A. Walker,et al. SPoT: A Trainable Sentence Planner , 2001, NAACL.

[5] Chris Callison-Burch,et al. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[6] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[8] Brendan T. O'Connor,et al. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[9] Noah A. Smith,et al. Good Question! Statistical Ranking for Question Generation , 2010, NAACL.

[10] Kevin Knight,et al. Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[11] Noah A. Smith,et al. Question Generation via Overgenerating Transformations and Ranking , 2009 .