DEPARTMENT OF INTELLIGENT SYSTEMS

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, since word-forms cannot be matched against a morphological lexicon. This paper discusses a machine learning approach to the automatic lemmatization of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatization into two subproblems: learning to perform morphosyntactic tagging of words in a text, and learning to perform morphological analysis, which produces the lemma from the word-form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. We train the tagger on a manually annotated corpus consisting of 100,000 running words. We train the analyzer on open-class inflecting Slovene words, namely nouns, adjectives, and main verbs, together being characterized by more than 400 different morphosyntactic tags. The training set for the analyzer consists of a morphological lexicon containing 15,000 lemmas. We evaluate the learned model on word lists extracted from a corpus of Slovene texts containing 500,000 words, and show that our morphological analysis module achieves 98.6% accuracy, while the combination of the tagger and analyzer is 92.0% accurate on unknown inflecting Slovene words.

[1]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[2]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[3]  Jean-Pierre Chanod,et al.  Creating a tagset, lexicon and guesser for a French tagger , 1995, ArXiv.

[4]  Raymond J. Mooney,et al.  Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs , 1995, J. Artif. Intell. Res..

[5]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[6]  Nicoletta Calzolari,et al.  EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[7]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[8]  Saso Dzeroski,et al.  Learning Multilingual Morphology with CLOG , 1998, ILP.

[9]  Tomaz Erjavec,et al.  East meets West: Producing Multilingual Resources in a European Context , 1998 .

[10]  Tomaz Erjavec,et al.  The MULTEXT-East Slovene Lexicon , 1998 .

[11]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[12]  Dan Tufis Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[13]  Saso Dzeroski,et al.  Learning to Lemmatise Slovene Words , 2001, Learning Language in Logic.

[14]  Stefan InstituteJamova The Elan Slovene-english Aligned Corpus , 1999 .

[15]  Hans van Halteren,et al.  Syntactic Wordclass Tagging , 1999 .

[16]  Tamás Váradi,et al.  Morpho-syntactic ambiguity and tagset design for Hungarian , 1999 .

[17]  Walter Daelemans,et al.  Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus , 2000, LREC.

[18]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[19]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[20]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[21]  Jakub Zavrel,et al.  Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets , 2000, LREC.

[22]  Tomaz Erjavec Harmonised Morphosyntactic Tagging for Seven Languages and Orwell's 1984 , 2001, NLPRS.

[23]  Beáta Megyesi Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[24]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[25]  Tomaz Erjavec,et al.  The IJS-ELAN Slovene-English Parallel Corpus , 2002 .