PSG hybrid approach to automatic corpus annotation

This paper describes and evaluates a hybrid non-probabilistic parsing method for the grammatical annotation of large corpora and the live analysis of teaching sentences, employing a layered scheme of lexiconand contextbased Constraint Grammars on the one hand, and Phrase Structure Grammars or syntactic bracketing algorithms on the other. The method has been fully implemented by the author for Danish and Portuguese, and to a certain degree, Spanish. Add-on-modules were also produced for existing English and French taggers. On running newspaper text, overall correctness rates (F-scores) for the two most mature systems approach 99% for word class (PoS) and 95-96% for syntactic function tags at the shallow CG-level. Though propagating CG-errors into structural errors, subsequent constituent tree analysis adds under 1% of new attachment errors on manually revised CG-input. All modules in combination, without revision, generate 50-75% structurally ”legal” syntactic trees.