Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming

This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word segmentation is introduced and a simple genetic algorithm is used in the search for a segmentation that corresponds to the best bias value. In the second phase, the words segmented by the genetic algorithm are used as an input for the first order decision list learner CLOG. The result is a set of first order rules which can be used for segmentation of unseen words. When applied on either the training data or unseen data, these rules produce segmentations which are linguistically meaningful, and to a large degree conforming to the annotation provided.

[1]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[2]  P. H. Matthews,et al.  Morphology: An Introduction to the Theory of Word-Structure , 1974 .

[3]  François Yvon Paradigmatic cascades: a linguistically sound model of pronunciation by analogy , 1997 .

[4]  François Yvon Prononcer par analogie : motivation, formalisation et evaluation , 1996 .

[5]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[6]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[7]  B. Fradin L'approche à deux niveaux en morphologie computationnelle et les développements récents de la morphologie , 1994 .

[8]  Charles X. Ling,et al.  Learning the Past Tense of English Verbs: The Symbolic Pattern Associator vs. Connectionist Models , 1993, J. Artif. Intell. Res..

[9]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[10]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[11]  Saso Dzeroski,et al.  Learning Multilingual Morphology with CLOG , 1998, ILP.

[12]  James Cussens Part-of-Speech Tagging Using Progol , 1997, ILP.

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Kimmo Koskenniemi,et al.  Two-Level Morphology , 1983 .

[15]  X. LingCharles Learning the past tense of English verbs , 1994 .

[16]  Raymond J. Mooney,et al.  Learning Semantic Grammars with Constructive Inductive Logic Programming , 1993, AAAI.

[17]  F. Saussure,et al.  Course in General Linguistics , 1960 .

[18]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[19]  Walter Daelemans,et al.  IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[20]  Tomaz Erjavec,et al.  Learning Word Segmentation Rules for Tag Prediction , 1999, ILP.

[21]  Ferdinand de Saussure Course in General Linguistics , 1916 .

[22]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[23]  Zellig S. Harris,et al.  From Phoneme to Morpheme , 1955 .

[24]  A.P.J. van den Bosch,et al.  Learning to pronounce written words : a study in inductive language learning , 1997 .

[25]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[26]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[27]  P. Matthews The Concise Oxford Dictionary of Linguistics , 1998 .

[28]  Walter Daelemans,et al.  Morphological Analysis as Classification: an Inductive-Learning Approach , 1996, ArXiv.

[29]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[30]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[31]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[32]  Z. Harris From Phoneme to Morpheme , 1955 .

[33]  Stephen Muggleton,et al.  Analogical Prediction , 1999, ILP.

[34]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[35]  Suresh Manandhar,et al.  A Hybrid Approach t Word Segmentation , 1998, ILP.

[36]  Dimitar Kazakov,et al.  Achievements and Prospects of Learning Word Morphology with Inductive Logic Programming , 2001, Learning Language in Logic.

[37]  Dimitar Kazakov Unsupervised Learning of Naive Morphology with Genetic Algorithms , 1997 .

[38]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[39]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[40]  Raymond J. Mooney,et al.  Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs , 1995, J. Artif. Intell. Res..

[41]  J MooneyRaymond,et al.  Induction of first-order decision lists , 1995 .