论文信息 - DEPARTMENT OF INTELLIGENT SYSTEMS - 字舞流文

DEPARTMENT OF INTELLIGENT SYSTEMS

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, since word-forms cannot be matched against a morphological lexicon. This paper discusses a machine learning approach to the automatic lemmatization of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatization into two subproblems: learning to perform morphosyntactic tagging of words in a text, and learning to perform morphological analysis, which produces the lemma from the word-form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. We train the tagger on a manually annotated corpus consisting of 100,000 running words. We train the analyzer on open-class inflecting Slovene words, namely nouns, adjectives, and main verbs, together being characterized by more than 400 different morphosyntactic tags. The training set for the analyzer consists of a morphological lexicon containing 15,000 lemmas. We evaluate the learned model on word lists extracted from a corpus of Slovene texts containing 500,000 words, and show that our morphological analysis module achieves 98.6% accuracy, while the combination of the tagger and analyzer is 92.0% accurate on unknown inflecting Slovene words.

Saso Dzeroski | Tomaz Erjavec | S. Džeroski | T. Erjavec

[1] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[2] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[3] Jean-Pierre Chanod,et al. Creating a tagset, lexicon and guesser for a French tagger , 1995, ArXiv.

[4] Raymond J. Mooney,et al. Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs , 1995, J. Artif. Intell. Res..

[5] Walter Daelemans,et al. MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[6] Nicoletta Calzolari,et al. EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[7] Andrei Mikheev,et al. Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[8] Saso Dzeroski,et al. Learning Multilingual Morphology with CLOG , 1998, ILP.

[9] Tomaz Erjavec,et al. East meets West: Producing Multilingual Resources in a European Context , 1998 .

[10] Tomaz Erjavec,et al. The MULTEXT-East Slovene Lexicon , 1998 .

[11] Nancy Ide,et al. Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[12] Dan Tufis. Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[13] Saso Dzeroski,et al. Learning to Lemmatise Slovene Words , 2001, Learning Language in Logic.

[14] Stefan InstituteJamova. The Elan Slovene-english Aligned Corpus , 1999 .

[15] Hans van Halteren,et al. Syntactic Wordclass Tagging , 1999 .

[16] Tamás Váradi,et al. Morpho-syntactic ambiguity and tagset design for Hungarian , 1999 .

[17] Walter Daelemans,et al. Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus , 2000, LREC.

[18] Christopher D. Manning,et al. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[19] Jan Hajic,et al. Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[20] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[21] Jakub Zavrel,et al. Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets , 2000, LREC.

[22] Tomaz Erjavec. Harmonised Morphosyntactic Tagging for Seven Languages and Orwell's 1984 , 2001, NLPRS.

[23] Beáta Megyesi. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish , 2001, EMNLP.

[24] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[25] Tomaz Erjavec,et al. The IJS-ELAN Slovene-English Parallel Corpus , 2002 .