Tagging Icelandic text: A linguistic rule-based approach

The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rulebased system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a global heuristics component. The heuristics guess the functional roles of the words in a sentence, mark prepositional phrases, and use the acquired knowledge to force feature agreement where appropriate. IceMorphy is used for guessing the tag profile for unknown words and for automatically filling tag profile gaps in the lexicon. Evaluation shows that IceTagger achieves 91.54% accuracy, a substantial improvement on the highest accuracy, 90.44%, obtained using three state-of-the-art data-driven taggers. Furthermore, the accuracy increases to 92.95% by using IceTagger along with two data-driven taggers in a simple voting scheme. The development time of the tagging system was only 7 man-months, which can be considered a short development period for a linguistic rule-based system.

[1]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[2]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[3]  Jean-Pierre Chanod,et al.  Tagging French - comparing a statistical and a constraint-based method , 1995, EACL.

[4]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[5]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[6]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[7]  Wolfgang Lezius Ims Morphy -- German Morphology, Part-of-Speech Tagging and Applications , 2000 .

[8]  Atro Voutilainen A syntax-based part-of-speech analyser , 1995, EACL.

[9]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.

[10]  Christer Samuelsson,et al.  Morphological Tagging Based Entirely on Bayesian Inference , 1993, NODALIDA.

[11]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[12]  Hrafn Loftsson,et al.  Tagging a Morphologically Complex Language Using Heuristics , 2006, FinTAL.

[13]  Hrafn Loftsson,et al.  Tagging Icelandic text: an experiment with integrations and combinations of taggers , 2007, Lang. Resour. Evaluation.

[14]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[15]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[16]  Preslav Nakov,et al.  Guessing morphological classes of unknown German nouns , 2003, RANLP.

[17]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[18]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[19]  Robert F. Simmons,et al.  A Computational Approach to Grammatical Coding of English Words , 1963, JACM.

[20]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[21]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[22]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .