Improving the PoS tagging accuracy of Icelandic text

Previous work on part-of-speech (PoS) tagging Icelandic has shown that the morphological complexity of the language poses considerable difficulties for PoS taggers. In this paper, we increase the tagging accuracy of Icelandic text by using two methods. First, we present a new tagger, by integrating an HMM tagger into a linguistic rule-based tagger. Our tagger obtains state-of-the-art tagging accuracy of 92.31% using the standard test set derived from the IFD corpus, and 92.51% using a corrected version of the corpus. Second, we design an external tagset, by removing information from the internal tagset which reflects distinctions that are not morphologically based. Using the external tagset for evaluation, the tagging accuracy further increases to 93.63%.

[1]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.

[2]  Eiríkur Rögnvaldsson,et al.  IceNLP: a natural language processing toolkit for icelandic , 2007, INTERSPEECH.

[3]  Hrafn Loftsson,et al.  Tagging Icelandic text: an experiment with integrations and combinations of taggers , 2007, Lang. Resour. Evaluation.

[4]  Hrafn Loftsson,et al.  Correcting a POS-Tagged Corpus Using Three Complementary Methods , 2009, EACL.

[5]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[6]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[7]  Thorsten Brants Internal and external tagsets in part-of-speech tagging , 1997, EUROSPEECH.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Mark Steedman,et al.  Example Selection for Bootstrapping Statistical Parsers , 2003, NAACL.

[10]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[11]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[12]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[13]  Hrafn Loftsson,et al.  Tagging a Morphologically Complex Language Using Heuristics , 2006, FinTAL.

[14]  Walter Daelemans,et al.  Recent advances in memory-based part-of-speech tagging , 1999 .

[15]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[16]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[17]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[18]  Verena Henrich,et al.  CombiTagger: A System for Developing Combined Taggers , 2009, FLAIRS.

[19]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[22]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[23]  Hrafn Loftsson,et al.  Tagging Icelandic Text using a Linguistic and a Statistical Tagger , 2007, NAACL.

[24]  Eiríkur Rögnvaldsson,et al.  Context-Sensitive Spelling Correction and Rich Morphology , 2009, NODALIDA.

[25]  Mark Dredze,et al.  Icelandic Data Driven Part of Speech Tagging , 2008, ACL.

[26]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[27]  Stefán Briem Automatisk morfologisk analyse af islandsk tekst (Automatic morphological analysis of Icelandic text) [In Danish] , 1989, NODALIDA.

[28]  Beáta Megyesi Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[29]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[30]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[31]  Jan Hajic,et al.  Serial Combination of Rules and Statistics: A Case Study in Czech Tagging , 2001, ACL.

[32]  Eiríkur Rögnvaldsson,et al.  IceParser: An Incremental Finite-State Parser for Icelandic , 2007, NODALIDA.

[33]  Tapio Salakoski,et al.  Advances in Natural Language Processing: 5th International Conference, FinTAL 2006 Turku, Finland, August 23-25, 2006 Proceedings (Lecture Notes in Computer Science) , 2006 .

[34]  Mark Dredze,et al.  Further Results and Analysis of Icelandic Part of Speech Tagging , 2008 .