Developing a PoS-tagged corpus using existing tools

In this paper, we describe the development of a new tagged corpus of Icelandic, consisting of about 1 million tokens. The goal is to use the corpus, among other things, as a new gold standard for training and testing PoS taggers. We describe the individual phases of the corpus construction, i.e. text selection and cleaning, sentence segmentation and tokenisation, PoS tagging with a combination method, error detection, and error correction. Furthermore, we discuss what problems have emerged, highlight which software tools have been found to be useful, and identify which tools are re-usable across different languages. Our preliminary evaluation results show that the error detection programs are effective and that our tagger combination method is crucial with regard to the amount of hand-correction that must be carried out in future work. We believe that our work will be of help to those wishing to develop similar resources for less-resourced languages.

[1]  Eiríkur Rögnvaldsson,et al.  IceNLP: a natural language processing toolkit for icelandic , 2007, INTERSPEECH.

[2]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[3]  Walter Daelemans,et al.  Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers , 2000, LREC.

[4]  Hrafn Loftsson,et al.  Tagging Icelandic text: A linguistic rule-based approach , 2008, Nordic Journal of Linguistics.

[5]  Eiríkur Rögnvaldsson,et al.  Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic , 2007 .

[6]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[7]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[8]  Stefán Briem Automatisk morfologisk analyse af islandsk tekst (Automatic morphological analysis of Icelandic text) [In Danish] , 1989, NODALIDA.

[9]  Verena Henrich,et al.  CombiTagger: A System for Developing Combined Taggers , 2009, FLAIRS.

[10]  Jesse de Does,et al.  Tagging the Dutch PAROLE Corpus , 2001, CLIN.

[11]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[12]  Eric Brill,et al.  Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.

[13]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[14]  Hrafn Loftsson,et al.  Correcting a POS-Tagged Corpus Using Three Complementary Methods , 2009, EACL.

[15]  Anna Björk Nikulásdóttir,et al.  Icelandic Language Resources and Technology: Status and Prospects , 2009 .

[16]  Mark Dredze,et al.  Icelandic Data Driven Part of Speech Tagging , 2008, ACL.

[17]  Hrafn Loftsson,et al.  Tagging Icelandic text: an experiment with integrations and combinations of taggers , 2007, Lang. Resour. Evaluation.

[18]  Jonas Sj̈obergh Combining POS-taggers for improved accuracy on Swedish text , 2003 .

[19]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[20]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[21]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[22]  Eiríkur Rögnvaldsson,et al.  Improving the PoS tagging accuracy of Icelandic text , 2009, NODALIDA.

[23]  Eiríkur Rögnvaldsson,et al.  IceParser: An Incremental Finite-State Parser for Icelandic , 2007, NODALIDA.