Knowledge-Rich Morphological Priors for Bayesian Language Models

We present a morphology-aware nonparametric Bayesian model of language whose prior distribution uses manually constructed finitestate transducers to capture the word formation processes of particular languages. This relaxes the word independence assumption and enables sharing of statistical strength across, for example, stems or inflectional paradigms in different contexts. Our model can be used in virtually any scenario where multinomial distributions over words would be used. We obtain state-of-the-art results in language modeling, word alignment, and unsupervised morphological disambiguation for a variety of morphologically rich languages.

[1]  Alon Lavie,et al.  The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation , 2012, AMTA.

[2]  Christof Monz,et al.  Statistical Machine Translation with Local Language Models , 2011, EMNLP.

[3]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[4]  Lauri Karttunen,et al.  Twenty-five years of finite-state morphology , 2005 .

[5]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[6]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[7]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[8]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[9]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[10]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[11]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[12]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[13]  Bob Carpenter,et al.  Scaling High-Order Character Language Models to Gigabytes , 2005, ACL 2005.

[14]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[15]  Kemal Oflazer Two-level description of Turkish morphology , 1993 .

[16]  Matthew G. Snover,et al.  A Bayesian Model for Morpheme and Paradigm Identification , 2001, ACL.

[17]  Kemal Oflazer,et al.  Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation , 2007, WMT@ACL.

[18]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[19]  Brian Roark Open vocabulary language modeling for binary response typing interfaces , 2009 .

[20]  Mark Dredze,et al.  Learning Sub-Word Units for Open Vocabulary Speech Recognition , 2011, ACL.

[21]  Ebru Arisoy,et al.  Discriminative n-gram language modeling for Turkish , 2008, INTERSPEECH.

[22]  Ondrej Bojar,et al.  Czech-English Word Alignment , 2006, LREC.

[23]  Noah A. Smith,et al.  Context-Based Morphological Disambiguation with Random Fields , 2005, HLT.

[24]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[25]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[26]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[27]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[28]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[29]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[30]  Thomas L. Griffiths,et al.  Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models , 2011, J. Mach. Learn. Res..

[31]  Gökhan Tür,et al.  Statistical Morphological Disambiguation for Agglutinative Languages , 2000, COLING.

[32]  Hermann Ney,et al.  Hierarchical hybrid language models for open vocabulary continuous speech recognition using WFST , 2012, SAPA@INTERSPEECH.

[33]  Gülsen Eryigit,et al.  Word Alignment for English-Turkish Language Pair , 2012, LREC.

[34]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[35]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[36]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[37]  Kristina Toutanova,et al.  Generating Complex Morphology for Machine Translation , 2007, ACL.

[38]  Krister Lindén,et al.  Predictive Text Entry for Agglutinative Languages Using Unsupervised Morphological Segmentation , 2012, CICLing.

[39]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[40]  Jan Hajic,et al.  Serial Combination of Rules and Statistics: A Case Study in Czech Tagging , 2001, ACL.

[41]  Alexandru Ceausu,et al.  South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and , 2010 .

[42]  Andrew J. Lundberg,et al.  Discovering Morphemic Suffixes A Case Study In MDL Induction , 1995 .

[43]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[44]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[45]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[46]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.