论文信息 - A scheme for comparative evaluation of diverse parsing systems

A scheme for comparative evaluation of diverse parsing systems

We argue that the current dominant paradigm in parser evaluation work, which combines use of the Penn Treebank reference corpus and of the Parseval scoring metrics, is not well-suited to the task of general comparative evaluation of diverse parsing systems. We propose an alternative approach which has two key components. Firstly, we propose parsed corpora for testing that are much flatter than those currently used, whose “gold standard” parses encode only those grammatical constituents upon which there is broad agreement across a range of grammatical theories. Secondly, we propose modified evaluation metrics that require parser outputs to be ‘faithful to’, rather than mimic, the broadly agreed structure encoded in the flatter gold standard analyses.

Christian R. Huyck | Mark Hepple | Robert Gaizauskas

[1] Ralph Grishman,et al. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[2] Youshouzhai Gu. Echo , 1980, The Craft of Poetry.

[3] Lorna Balkan,et al. TSNLP - Test Suites for Natural Language Processing , 1996, COLING.

[4] R. Kennedy,et al. Defense Advanced Research Projects Agency (DARPA). Change 1 , 1996 .

[5] Geoffrey Sampson. English for the computer , 1995 .

[6] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7] Dekang Lin,et al. A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[8] Ted Briscoe,et al. SPARKLE Work Package 1: Specification of Phrasal Parsing. Final Report , 1997 .

[9] Klaus Netter,et al. Report of the Study Group on Assessment and Evaluation , 1996, ArXiv.

[10] Ralph Grishman,et al. Evaluating Parsing Strategies Using Standardized Parse Files , 1992, ANLP.