论文信息 - NLTK tagger for Albanian using iterative approach

NLTK tagger for Albanian using iterative approach

This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.

Arbana Kadriu

[1] Igor Boehm. Unigram Backoff vs . TnT Evaluating Part of Speech Taggers Introduction to Computational Linguistics , 2005 .

[2] Jason Baldridge,et al. Multidisciplinary Instruction with the Natural Language Toolkit , 2008 .

[3] Manu Konchady. Text Mining Application Programming , 2006 .

[4] Lule Ahmedi,et al. Morphological segmentation of nouns using an inductive logic programming system , 2010, ICT 2010.

[5] Naushad UzZaman,et al. Comparison of different POS Tagging Techniques (n-gram, HMM and Brill’s tagger) for Bangla , 2007 .

[6] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[7] Arbana Kadriu. Modeling a Two-Level Formalism for Inflection of Nouns and Verbs in Albanian , 2010 .

[8] Ruslan Mitkov,et al. The Oxford handbook of computational linguistics , 2003 .

[9] Jochen Trommer,et al. A Morphological Tagger for Standard Albanian , 2003 .