Joint Bayesian Morphology Learning for Dravidian Languages

In this paper a methodology for learning the complex agglutinative morphology of some Indian languages using Adaptor Grammars and morphology rules is presented. Adaptor grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a plain text corpus. Once morphological segmentations are produce, regular expressions for sandhi rules and orthography are applied to achieve the final segmentation. We test our algorithm in the case of two complex languages from the Dravidian family. The same morphological model and results are evaluated comparing to other state-of-the art unsupervised morphology learning systems

[1]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[2]  Phani Chaitanya Vempaty,et al.  Automatic Sandhi Spliting Method for Telugu, an Indian Language , 2011 .

[3]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[4]  Bhadriraju Krishnamurti,et al.  Comparative Dravidian Studies , 1971 .

[5]  Insup Taylor,et al.  Scripts and Literacy , 2012 .

[6]  Sharon Goldwater,et al.  Minimally-Supervised Morphological Segmentation using Adaptor Grammars , 2013, TACL.

[7]  Vincent Ng,et al.  High-Performance, Language-Independent Morphological Segmentation , 2007, HLT-NAACL.

[8]  Lars Borin,et al.  Unsupervised Learning of Morphology , 2011, CL.

[9]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[10]  Henning Andersen Sandhi phenomena in the languages of Europe , 1986 .

[11]  S. Agesthialingom,et al.  Dravidian Case System , 1976 .

[12]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[13]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[14]  Bhadriraju Krishnamurti,et al.  The Dravidian Languages , 2003 .

[15]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[16]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[17]  Insup Taylor,et al.  Scripts and literacy : reading and learning to read alphabets, syllabaries, and characters , 2012 .