Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages

In this paper, we describe a novel and effective approach for automatically decomposing a word into stem and suffixes. Russian and Turkish are used as exemplars of fusional and agglutinating languages. Rather than relying on corpus counts, we use a small number of word-pairs as training data, that can be particularly suited for under-resourced languages. For fusional languages, we initially learn a tree of aligned suffix rules (TASR) from word-pairs. The tree is built top-down, from general to specific rules, using suffix rule frequency and rule subsumption, and is executed bottom-up, i.e., the most specific rule that fires is chosen. TASR is used to segment a word form into a stem and suffix sequence. For fusional languages learning through generation (using TASR) is essential for proper stem extraction. Subsequently, an unsupervised segmentation algorithm graph-based unsupervised suffix segmentation (GBUSS) is used to segment the suffix sequence. GBUSS employs a suffix graph where node merging, guided by an information-theoretic measure, generates suffix sequences. The approach, experimentally validated on Russian, is shown to be highly effective. For agglutinating languages only the GBUSS is needed for word decomposition. Promising experimental results for Turkish are obtained.

[1]  M. McShane,et al.  Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  Alexander Clark Memory-Based Learning of Morphology with Stochastic Transducers , 2002, ACL.

[4]  Karen Steffen Chung Saussure's Third Course of Lectures on General Linguistics (1910-1911), from the notebooks of Emile Constantin / Trosième Cours de linguistique générale (1910-1911) d'après les cahiers d'Emile Constantin. Ed. and transl. by Eisuke Komatsu & Roy Harris. , 1994 .

[5]  Serge Sharoff,et al.  Methods and tools for development of the Russian Reference Corpus , 2006 .

[6]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[7]  Saso Dzeroski,et al.  DEPARTMENT OF INTELLIGENT SYSTEMS , 2019 .

[8]  Hiroshi Nakagawa,et al.  Automatic Term Extraction Based on Perplexity of Compound Words , 2005, IJCNLP.

[9]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[10]  W Li,et al.  New stopping criteria for segmenting DNA sequences. , 2001, Physical review letters.

[11]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Stephen Muggleton,et al.  Analogical Prediction , 1999, ILP.

[14]  Suresh Manandhar,et al.  Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming , 2001, Machine Learning.

[15]  Shuly Wintner,et al.  Finite-State Registered Automata and Their Uses in Natural Languages , 2005, FSMNLP.

[16]  Raymond J. Mooney,et al.  Learning the past tense of English verbs using inductive logic programming , 1995, Learning for Natural Language Processing.

[17]  Zellig S. Harris,et al.  From Phoneme to Morpheme , 1955 .

[18]  Etienne Barnard,et al.  Default-and-refinement approach to pronunciation prediction , 2004 .

[19]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[20]  Peter A. Flach,et al.  Morphology learning using tree of aligned suffix rules , 2007 .

[21]  Alexander F. Gelbukh,et al.  Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort , 2003, CICLing.

[22]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[23]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[24]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.