Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline.

[1]  Georges Martial Embanga Aborobongui Processus segmentaux et tonals en Mbondzi - (variété de la langue embosi C25) - , 2013 .

[2]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[3]  Guy Noël Kouarata Variations de formes dans la langue Mbochi (Bantu C25) , 2014 .

[4]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[5]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[6]  Mark Johnson,et al.  Nonparametric bayesian models of lexical acquisition , 2007 .

[7]  Joshua B. Tenenbaum,et al.  Fragment Grammars: Exploring Computation and Reuse in Language , 2009 .

[8]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[9]  Mark Johnson,et al.  Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars , 2008, SIGMORPHON.

[10]  Odette Ambouroue Eléments de description de l'orungu Langue bantu du Gabon (B11b) , 2007 .

[11]  Phil Blunsom,et al.  Adaptor Grammars for Learning Non-Concatenative Morphology , 2013, EMNLP.

[12]  Zhiyi Chi,et al.  Statistical Properties of Probabilistic Context-Free Grammars , 1999, CL.

[13]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[14]  Mark Johnson,et al.  Using Adaptor Grammars to Identify Synergies in the Unsupervised Acquisition of Linguistic Structure , 2008, ACL.

[15]  Anja Walter,et al.  Language Universals Markedness Theory And Natural Phonetic Processes , 2016 .

[16]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[17]  Roger Wright Speak: a short history of languages , 2002 .

[18]  Fatima Hamlaoui,et al.  Focus marking and the unavailability of inversion structures in the Bantu language Bàsàá (A43) , 2015 .

[19]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[20]  Laura J. Downing,et al.  On the ambiguous segmental status of nasals in homorganic NC sequences , 2002 .

[21]  David Odden,et al.  Bantu Phonology * , 2014 .

[22]  Tianchun Yang,et al.  Extending the Use of Adaptor Grammars for Unsupervised Morphological Segmentation of Unseen Languages , 2016, COLING.

[23]  D. Crystal What is language death , 2002 .

[24]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[25]  David Chiang,et al.  A case study on using speech-to-translation alignments for language documentation , 2017, ArXiv.

[26]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[27]  Sharon Goldwater,et al.  Minimally-Supervised Morphological Segmentation using Adaptor Grammars , 2013, TACL.

[28]  Steven Bird,et al.  Aikuma: A Mobile App for Collaborative Language Documentation , 2014 .

[29]  Mark Johnson,et al.  Exploring the Role of Stress in Bayesian Word Segmentation using Adaptor Grammars , 2014, TACL.

[30]  Graham Neubig,et al.  Phonemic Transcription of Low-Resource Tonal Languages , 2017, ALTA.

[31]  Mark Johnson,et al.  Modelling function words improves unsupervised word segmentation , 2014, ACL.

[32]  Lori Lamel,et al.  Dropping of the Class-Prefix Consonant, Vowel Elision and Automatic Phonological Mining in Embosi (Bantu C 25) , 2015 .