论文信息 - PSG hybrid approach to automatic corpus annotation

PSG hybrid approach to automatic corpus annotation

This paper describes and evaluates a hybrid non-probabilistic parsing method for the grammatical annotation of large corpora and the live analysis of teaching sentences, employing a layered scheme of lexiconand contextbased Constraint Grammars on the one hand, and Phrase Structure Grammars or syntactic bracketing algorithms on the other. The method has been fully implemented by the author for Danish and Portuguese, and to a certain degree, Spanish. Add-on-modules were also produced for existing English and French taggers. On running newspaper text, overall correctness rates (F-scores) for the two most mature systems approach 99% for word class (PoS) and 95-96% for syntactic function tags at the shallow CG-level. Though propagating CG-errors into structural errors, subsequent constituent tree analysis adds under 1% of new attachment errors on manually revised CG-input. All modules in combination, without revision, generate 50-75% structurally ”legal” syntactic trees.

E. Bick

[1] Atro Voutilainen,et al. A language-independent system for parsing unrestricted text , 1995 .

[2] Timo Järvinen,et al. A non-projective dependency parser , 1997, ANLP.

[3] Wojciech Skut,et al. Tagging Grammatical Functions , 1997, EMNLP.

[4] Thorsten Brants,et al. Cascaded Markov Models , 1999, EACL.

[5] Eckhard Bick,et al. Providing Internet Access to Portuguese Corpora: the AC/DC Project , 2000, LREC.

[6] Eckhard Bick. En Constraint Grammar Tagger/Parser for Dansk , 2001 .

[7] Torbjörn Lager. Transformation-Based Learning of Rules for Constraint Grammar Tagging , 2001, NODALIDA.

[8] S. Buchholz,et al. Memory-Based Grammatical Relation Finding , 2002 .

[9] Eckhard Bick,et al. Floresta Sintá(c)tica: A treebank for Portuguese , 2002, LREC.