Correcting a POS-Tagged Corpus Using Three Complementary Methods

The quality of the part-of-speech (PoS) annotation in a corpus is crucial for the development of PoS taggers. In this paper, we experiment with three complementary methods for automatically detecting errors in the PoS annotation for the Icelandic Frequency Dictionary corpus. The first two methods are language independent and we argue that the third method can be adapted to other morphologically complex languages. Once possible errors have been detected, we examine each error candidate and hand-correct the corresponding PoS tag if necessary. Overall, based on the three methods, we hand-correct the PoS tagging of 1,334 tokens (0.23% of the tokens) in the corpus. Furthermore, we re-evaluate existing state-of-the-art PoS taggers on Icelandic text using the corrected corpus.

[1]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[2]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3]  Eiríkur Rögnvaldsson,et al.  IceParser: An Incremental Finite-State Parser for Icelandic , 2007, NODALIDA.

[4]  DaelemansWalter,et al.  Improving accuracy in word class tagging through the combination of machine learning systems , 2001 .

[5]  Yuji Matsumoto,et al.  Detecting Errors in Corpora Using Support Vector Machines , 2002, COLING.

[6]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[7]  Gunnar Eriksson,et al.  The Linguistic Annotation System of the Stockholm - Umea , 1993, EACL.

[8]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[9]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[10]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[11]  Markus Dickinson,et al.  Representations for category disambiguation , 2008, COLING.

[12]  Beáta Megyesi Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[13]  Hans van Halteren,et al.  The Detection of Inconsistency in Manually Tagged Text , 2000, COLING 2000.

[14]  Hrafn Loftsson,et al.  Tagging Icelandic text: an experiment with integrations and combinations of taggers , 2007, Lang. Resour. Evaluation.

[15]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[16]  Stefán Briem Automatisk morfologisk analyse af islandsk tekst (Automatic morphological analysis of Icelandic text) [In Danish] , 1989, NODALIDA.

[17]  Mark Dredze,et al.  Icelandic Data Driven Part of Speech Tagging , 2008, ACL.

[18]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[19]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[20]  Karel Oliva,et al.  Achieving an Almost Correct PoS-Tagged Corpus , 2002, TSD.