论文信息 - Linguistic Indeterminacy as a Source of Errors in Tagging

Linguistic Indeterminacy as a Source of Errors in Tagging

Most evaluations of part-of-speech tagging compare the utput of an automatic tagger to some established standard, define the differences as tagging errors and try to remedy them by, e.g., more training of the tagger. The present article is based on a manual analysis of a large number of tagging errors. Some clear patterns among the errors can be discerned, and the sources of the errors as well as possible alternative methods of remedy are presented and discussed. In particular are the problems with undecidable

G. Kallgren

[1] Steven J. DeRose,et al. Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[2] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[3] Fred Karlsson. A Comprehensive Morphological Analyzer for Swedish , 1990 .

[4] Gunnel Kallgren. “The first million is hardest to get”: Building a Large Tagged Corpus as Automatically as Possible , 1990, COLING.

[5] Penelope Sibun,et al. A Practical Part-of-Speech Tagger , 1992, ANLP.

[6] Fred Karlsson,et al. SWETWOL: A Comprehensive Morphological Analyser for Swedish , 1992 .

[7] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8] Douglas R. Cutting. Porting a Stochastic Part-of-Speech Tagger to Swedish , 1993, NODALIDA.

[9] Atro Voutilainen,et al. Specifying a shallow grammatical representation for parsing purposes , 1995, EACL.