IceNLP: a natural language processing toolkit for icelandic

Icelandic is a morphologically complex language, for which language technology resources are scarce. Only a few years ago, it could be stated that language technology was practically non-existent in Iceland. In this paper, we describe the development of an NLP toolkit for processing the language, the challenges faced and the decisions made during development. The current version of the toolkit consists of a tokeniser/sentence segmentiser, a morphological analyser, a linguistic rule-based tagger, and a finite-state parser. The development of our toolkit is a step towards building a Basic Language Resource Toolkit (BLARK) for the Icelandic language.

[1]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[2]  Jean-Pierre Chanod,et al.  Incremental Finite-State Parsing , 1997, ANLP.

[3]  Eiríkur Rögnvaldsson,et al.  Improving the PoS tagging accuracy of Icelandic text , 2009, NODALIDA.

[4]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.

[5]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6]  Eiríkur Rögnvaldsson,et al.  IceParser: An Incremental Finite-State Parser for Icelandic , 2007, NODALIDA.

[7]  Hrafn Loftsson,et al.  Tagging Icelandic text: an experiment with integrations and combinations of taggers , 2007, Lang. Resour. Evaluation.

[8]  Hrafn Loftsson,et al.  Tagging Icelandic Text using a Linguistic and a Statistical Tagger , 2007, NAACL.

[9]  Gregory Grefenstette,et al.  Regular expressions for language engineering , 1996, Natural Language Engineering.

[10]  Preslav Nakov,et al.  Guessing morphological classes of unknown German nouns , 2003, RANLP.

[11]  Hrafn Loftsson,et al.  Correcting a POS-Tagged Corpus Using Three Complementary Methods , 2009, EACL.

[12]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[13]  Eiríkur Rögnvaldsson,et al.  A shallow syntactic annotation scheme for Icelandic text , 2006 .

[14]  Hrafn Loftsson,et al.  Tagging a Morphologically Complex Language Using Heuristics , 2006, FinTAL.

[15]  Beata Megyesi Data-driven syntactic analysis , 2002 .

[16]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[17]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[18]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .