论文信息 - Session 3: Human Language Evaluation

Session 3: Human Language Evaluation

* Cross-system evaluation: This is a mainstay of the periodic ARPA evaluations on competing systems. Multiple sites agree to run their respective systems on a single application, so that results across systems are comparable. This includes evaluations such as message understanding (MUC)[6], information retrieval (TREC)[7], spoken language systems (ATIS)[8], and automated speech recognition (CSR)[8].

Lynette Hirschman

[1] Chris Brew,et al. Automatic Evaluation of Computer Generated Text: A Progress Report on the TextEval Project , 1994, HLT.

[2] Karen Spärck Jones. Towards Better NLP System Evaluation , 1994, HLT.

[3] Jonathan G. Fiscus,et al. Benchmark Tests for the DARPA Spoken Language Program , 1993, HLT.

[4] Donna K. Harman,et al. Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[5] Robert C. Moore. Semantic Evaluation for Spoken-Language Systems , 1994, HLT.

[6] Jonathan G. Fiscus,et al. 1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[7] Ralph Grishman,et al. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[8] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[9] John S. White,et al. Evaluation in the ARPA Machine Translation Program: 1993 Methodology , 1994, HLT.

[10] Ralph Grishman. Whither Written Language Evaluation? , 1994, HLT.

[11] Beth Sundheim,et al. Survey of the Message Understanding Conferences , 1993, HLT.