Modeling and learning multilingual inflectional morphology in a minimally supervised framework

Computational morphology is an important component of most natural language processing tasks including machine translation, information retrieval, wordsense disambiguation, parsing, and text generation. Morphological analysis, the process of finding a root form and part-of-speech of an inflected word form, and its inverse, morphological generation, can provide fine-grained part of speech information and help resolve necessary syntactic agreements. In addition, morphological analysis can reduce the problem of data sparseness through dimensionality reduction. This thesis presents a successful original paradigm for both morphological analysis and generation by treating both tasks in a competitive linkage model based on a combination of diverse inflection-root similarity measures. Previous approaches to the machine learning of morphology have been essentially limited to string-based transduction models. In contrast, the work presented here integrates both several new noise-robust, trie-based supervised methods for learning these transductions, and also a suite of unsupervised alignment models based on weighted Levenshtein distance, position-weighted contextual similarity, and several models of distributional similarity including expected relative frequency. Via iterative bootstrapping the combination of these models yields a full lemmatization analysis competitive with fully supervised approaches but without any direct supervision. In addition, this thesis also presents an original translingual projection model for morphology induction, where previously learned morphological analyses in a second language can be robustly projected via bilingual corpora to yield successful analyses in the new target language without any monolingual supervision. Collectively these methods outperform previously published algorithms for

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[4]  James L. McClelland,et al.  On learning the past-tenses of English verbs: implicit rules or parallel distributed processing , 1986 .

[5]  S. Pinker,et al.  On language and connectionism: Analysis of a parallel distributed processing model of language acquisition , 1988, Cognition.

[6]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[7]  Ellen Riloff,et al.  Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing , 1996, Lecture Notes in Computer Science.

[8]  Dekai Wu,et al.  An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[9]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[10]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[11]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[12]  Dimitar Kazakov Unsupervised Learning of Naive Morphology with Genetic Algorithms , 1997 .

[13]  Ian Cloete,et al.  Automatic Acquisition of Two-Level Morphological Rules , 1997, ANLP.

[14]  Douglas A. Jones,et al.  Twisted pair grammar: support for rapid development of machine translation for low density languages , 1998, AMTA.

[15]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[16]  Kemal Oflazer,et al.  Practical Bootstrapping of Morphological Analyzers , 1999, CoNLL.

[17]  David Yarowsky,et al.  Language Independent, Minimally Supervised Induction of Lexical Probabilities , 2000, ACL.

[18]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[19]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[20]  øöö Blockinø Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000 .

[21]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[22]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[23]  Alexander Clark Partially Supervised Learning of Morphology with Stochastic Transducers , 2001, NLPRS.

[24]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[25]  Alexander Clark,et al.  Learning Morphology with Pair Hidden Markov Models , 2001, ACL.

[26]  M. McShane,et al.  Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning , 2001, Computational Linguistics.

[27]  Matthew G. Snover,et al.  A Bayesian Model for Morpheme and Paradigm Identification , 2001, ACL.

[28]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[29]  Alexander Clark Memory-Based Learning of Morphology with Stochastic Transducers , 2002, ACL.

[30]  Richard Wicentowski,et al.  Unsupervised Italian Word Sense Disambiguation using WordNets and Unlabeled Corpora , 2002, SENSEVAL.

[31]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[32]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.