Morphological inference from bitext for resource-poor languages

The development of rich, multi-lingual corpora is essential for enabling new types of large-scale inquiry into the nature of language (Abney and Bird, 2010; Lewis and Xia, 2010). However, significant digital resources currently exist for only a handful of the world's languages. The present dissertation addresses this issue by introducing new techniques for creating rich corpora by enriching existing resources via automated processing. As a way of leveraging existing resources, this dissertation describes an automated method for extracting bitext (text accompanied by a translation) from bilingual documents. Digitized copies of printed books are mined for foreign-language material, using statistical methods for language identification and word alignment to identify instances of English-foreign bitext. After parsing the English text and transferring this analysis via the word alignments, the foreign word tokens are tagged with English glosses and morphosyntactic features. Tagged tokens such as these constitute the input to a new algorithm, presented in this dissertation, for performing morphology induction. Drawing on previous work on unsupervised morphology induction which uses the principle of minimum description length to drive the analysis (Goldsmith, 2001), the present algorithm uses a greedy hill-climbing search to minimize the size of a paradigm-based morphological description of the language. The algorithm simultaneously segments wordforms into their component morphemes and organizes stems and affixes into a paradigmatic structure. Because tagged tokens are used as input, the morphemes produced by this induction method are paired with meaningful morphosyntactic features, an improvement over algorithms for unsupervised morphology based on monolingual text, which treat morphemes purely as strings of letters. Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis.

[1]  Yunheng Ji MORPHOLOGY , 1937, A Grammar of Italian Sign Language (LIS).

[2]  Mikko Kurimo,et al.  Morpho Challenge 2005-2010: Evaluations and Results , 2010, SIGMORPHON.

[3]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[4]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[5]  Baden Hughes,et al.  Frontiers in Linguistic Annotation for Lower-Density Languages , 2006 .

[6]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[7]  Katrin Erk,et al.  Minimally supervised lemmatization scheme induction through bilingual parallel corpora , 1998 .

[8]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[9]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[10]  Fei Xia,et al.  Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages , 2010, Lit. Linguistic Comput..

[11]  Dan Klein,et al.  Unsupervised Learning of Field Segmentation Models for Information Extraction , 2005, ACL.

[12]  Steven Bird,et al.  Towards a general model of interlinear text , 2003 .

[13]  A. Ross Structural Linguistics , 1953, Nature.

[14]  Hal Daumé,et al.  A Bayesian Model for Discovering Typological Implications , 2007, ACL.

[15]  Kemal Oflazer,et al.  Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish , 2010, ACL.

[16]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[17]  Fei Xia,et al.  Multilingual Structural Projection across Interlinear Text , 2007, HLT-NAACL.

[18]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[19]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  M. Goldsmith,et al.  Statistical Learning by 8-Month-Old Infants , 1996 .

[21]  Tomoharu Iwata,et al.  Learning Common Grammar from Multilingual Corpus , 2010, ACL.

[22]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[23]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[24]  Maosong Sun,et al.  Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm , 2010, COLING.

[25]  Fei Xia,et al.  Parsing, Projecting & Prototypes: Repurposing Linguistic Data on the Web , 2009, EACL.

[26]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[27]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[28]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[29]  Kemal Oflazer,et al.  Initial Explorations in English to Turkish Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[30]  John Cocke,et al.  A Statistical Approach to Language Translation , 1988, COLING.

[31]  Dan Klein,et al.  Phylogenetic Grammar Induction , 2010, ACL.

[32]  Steven Abney,et al.  Linguistic Issues in Language Technology LiLT , 2011 .

[33]  Harald Hammarström,et al.  Unsupervised Learning of Morphology and the Languages of the World , 2009 .

[34]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[35]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[36]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[37]  Stephan Vogel,et al.  Fixed Length Word Suffix for Factored Statistical Machine Translation , 2010, ACL.

[38]  Jason Baldridge,et al.  Computational strategies for reducing annotation effort in language documentation , 2010 .

[39]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[40]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[41]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[42]  Kristina Toutanova,et al.  Generating Complex Morphology for Machine Translation , 2007, ACL.

[43]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[44]  Steven Bird Last Words: Natural Language Processing and Linguistic Fieldwork , 2009, CL.

[45]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[46]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[47]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[48]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[49]  Lei Shi,et al.  Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model , 2008, EMNLP.

[50]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[51]  Siriwan Sereewattana Unsupervised Segmentation for Statistical Machine Translation , 2003 .

[52]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[53]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[54]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[55]  Fei Xia Multilingual Structural Projection across Interlinearized Text , 2007 .

[56]  Fei Xia,et al.  Language ID in the Context of Harvesting Language Data off the Web , 2009, EACL.

[57]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[58]  W. Johnson,et al.  Thesaurus Linguae Graecae canon of Greek authors and works , 1987 .

[59]  Steven Bird,et al.  The Human Language Project: Building a Universal Corpus of the World's Languages , 2010, ACL.

[60]  P. Lewis Ethnologue : languages of the world , 2009 .