论文信息 - Achieving an Almost Correct PoS-Tagged Corpus

Achieving an Almost Correct PoS-Tagged Corpus

After some theoretical discussion on the issue of representativity of a corpus, this paper presents a simple yet very efficient technique serving for (semi-) automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and application of "invalid bigrams", i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - VERB). Further, the paper describes the generalization of the "invalid bigrams" into "extended invalid bigrams of length n", for any natural n, which provides a powerful tool for error detection in a corpus. The approach is illustrated by English, German and Czech examples.

Karel Oliva | Pavel Kveton | K. Oliva | P. Kveton

[1] Kenji Ono,et al. Automatic Refinement of a POS Tagger Using a Reliable Parser and Plain Text Corpora , 2000, COLING.

[2] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3] Karel Oliva,et al. The Possibilities of Automatic Detection/Correction of Errors in Tagged Corpora: A Pilot Study on a German Corpus , 2001, TSD.

[4] Wojciech Skut,et al. An Annotation Scheme for Free Word Order Languages , 1997, ANLP.