论文信息 - Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages

Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages

In this paper, we describe a novel and effective approach for automatically decomposing a word into stem and suffixes. Russian and Turkish are used as exemplars of fusional and agglutinating languages. Rather than relying on corpus counts, we use a small number of word-pairs as training data, that can be particularly suited for under-resourced languages. For fusional languages, we initially learn a tree of aligned suffix rules (TASR) from word-pairs. The tree is built top-down, from general to specific rules, using suffix rule frequency and rule subsumption, and is executed bottom-up, i.e., the most specific rule that fires is chosen. TASR is used to segment a word form into a stem and suffix sequence. For fusional languages learning through generation (using TASR) is essential for proper stem extraction. Subsequently, an unsupervised segmentation algorithm graph-based unsupervised suffix segmentation (GBUSS) is used to segment the suffix sequence. GBUSS employs a suffix graph where node merging, guided by an information-theoretic measure, generates suffix sequences. The approach, experimentally validated on Russian, is shown to be highly effective. For agglutinating languages only the GBUSS is needed for word decomposition. Promising experimental results for Turkish are obtained.

Peter A. Flach | Bruno Golénia | Kseniya B. Shalonova

[1] M. McShane,et al. Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[2] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3] Alexander Clark. Memory-Based Learning of Morphology with Stochastic Transducers , 2002, ACL.

[4] Karen Steffen Chung. Saussure's Third Course of Lectures on General Linguistics (1910-1911), from the notebooks of Emile Constantin / Trosième Cours de linguistique générale (1910-1911) d'après les cahiers d'Emile Constantin. Ed. and transl. by Eisuke Komatsu & Roy Harris. , 1994 .

[5] Serge Sharoff,et al. Methods and tools for development of the Russian Reference Corpus , 2006 .

[6] Walter Daelemans,et al. Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[7] Saso Dzeroski,et al. DEPARTMENT OF INTELLIGENT SYSTEMS , 2019 .

[8] Hiroshi Nakagawa,et al. Automatic Term Extraction Based on Perplexity of Compound Words , 2005, IJCNLP.

[9] Daniel Jurafsky,et al. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[10] W Li,et al. New stopping criteria for segmenting DNA sequences. , 2001, Physical review letters.

[11] Kimmo Koskenniemi,et al. A General Computational Model for Word-Form Recognition and Production , 1984, ACL.