Preliminary Experiments on Unsupervised Word Discovery in Mboshi

The necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the Republic of Congo, we investigate unsuper-vised word discovery techniques from an unsegmented stream of phonemes. We compare different models and algorithms, both monolingual and bilingual, on a new corpus in Mboshi and French, and discuss various ways to represent the data with suitable granularity. An additional French-English corpus allows us to contrast the results obtained on Mboshi and to experiment with more data.

[1]  Graham Neubig Simple , Correct Parallelization for Blocked Gibbs Sampling Graham Neubig November , 2014 .

[2]  F. François Enquête et description des langues à tradition orale , 1973 .

[3]  Tanja Schultz,et al.  Word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment , 2016, Comput. Speech Lang..

[4]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[5]  Bleek Wilhelm Heinrich Immanuel De nominum generibus linguarum Africae australis, Copticae, Semiticarum aliarumque sexualium , 1851 .

[6]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[7]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[8]  Noah A. Smith,et al.  Nonparametric Word Segmentation for Machine Translation , 2010, COLING.

[9]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[10]  Tatsuya Kawahara,et al.  Learning a language model from continuous speech , 2010, INTERSPEECH.

[11]  PietraVincent J. Della,et al.  The mathematics of statistical machine translation , 1993 .

[12]  Tanja Schultz,et al.  Pronunciation Extraction from Phoneme Sequences through Cross-Lingual Word-to-Phoneme Alignment , 2013, SLSP.

[13]  Firstname Lastname,et al.  Inducing Bilingual Lexicons from Small Quantities of Sentence-Aligned Phonemic Transcriptions , 2015 .

[14]  Sebastian Stüker,et al.  Towards human translations guided language discovery for ASR systems , 2008, SLTU.

[15]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[16]  Annie Rialland,et al.  Les relatives possessives en mbochi (C25) , 2010 .

[17]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[18]  Tanja Schultz,et al.  Word segmentation through cross-lingual word-to-phoneme alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[19]  Mark Johnson,et al.  Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars , 2008, SIGMORPHON.

[20]  Hermann Ney,et al.  Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation , 2008, COLING.

[21]  Georges Martial Embanga Aborobongui Processus segmentaux et tonals en Mbondzi - (variété de la langue embosi C25) - , 2013 .

[22]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[23]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[24]  Guy Noël Kouarata Variations de formes dans la langue Mbochi (Bantu C25) , 2014 .

[25]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[26]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[27]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.