Robust ending guessing rules with application to slavonic languages

The paper studies the automatic extraction of diagnostic word endings for Slavonic languages aimed to determine some grammatical, morphological and semantic properties of the underlying word. In particular, ending guessing rules are being learned from a large morphological dictionary of Bulgarian in order to predict POS, gender, number, article and semantics. A simple exact high accuracy algorithm is developed and compared to an approximate one, which uses a scoring function previously proposed by Mikheev for POS guessing. It is shown how the number of rules of the latter can be reduced by a factor of up to 35, without sacrificing performance. The evaluation demonstrates coverage close to 100%, and precision of 97--99% for the approximate algorithm.

[1]  Mary P. Harper,et al.  Analysis of Unknown Lexical Items using Morphological and Syntactic Information with the TIMIT Corpus , 1997, VLC.

[2]  Eric Gaussier,et al.  Unsupervised learning of derivational morphology from inflectional lexicons , 1999 .

[3]  Walter Daelemans,et al.  Memory-Based Morphological Analysis , 1999, ACL.

[4]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[5]  David Yarowsky,et al.  Language Independent, Minimally Supervised Induction of Lexical Probabilities , 2000, ACL.

[6]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[7]  E. Paskaleva Compilation and validation of morphological resources ( overview of the morphology cooking technologies ) , 2003 .

[8]  Christian Jacquemin,et al.  Guessing morphology from terms and corpora , 1997, SIGIR '97.

[9]  Robert H. Baud,et al.  Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models , 2000, CoNLL/LLL.

[10]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[11]  Hervé Déjean Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora , 1998, CoNLL.

[12]  Jan Daciuk,et al.  Treatment of Unknown Words , 1999, WIA.

[13]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[14]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[15]  John Goldsmith,et al.  Automatic Collection and Analysis of GermanCompounds , 1998 .

[16]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[17]  Frank Keller,et al.  The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks , 2004, NAACL.

[18]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[19]  Preslav Nakov BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian , 1998 .

[20]  Preslav Nakov,et al.  Guessing morphological classes of unknown German nouns , 2003, RANLP.

[21]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[22]  Max Silberztein,et al.  Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.