A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.

[1]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[2]  Guy Noël Kouarata Variations de formes dans la langue Mbochi (Bantu C25) , 2014 .

[3]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[4]  Geoffrey Sampson,et al.  The Oxford Handbook of Computational Linguistics , 2003, Lit. Linguistic Comput..

[5]  Steven Bird,et al.  Aikuma: A Mobile App for Collaborative Language Documentation , 2014 .

[6]  Lori Lamel,et al.  Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville) , 2018, LREC.

[7]  F. François Enquête et description des langues à tradition orale , 1973 .

[8]  Jörg Franke,et al.  Phoneme Boundary Detection using Deep Bidirectional LSTMs , 2016, ITG Symposium on Speech Communication.

[9]  Lori Lamel,et al.  Dropping of the Class-Prefix Consonant, Vowel Elision and Automatic Phonological Mining in Embosi (Bantu C 25) , 2015 .

[10]  Georges Martial Embanga Aborobongui Processus segmentaux et tonals en Mbondzi - (variété de la langue embosi C25) - , 2013 .

[11]  Sebastian Stüker,et al.  BULBasaa: A Bilingual Basaa-French Speech Corpus for the Evaluation of Language Documentation Tools , 2018, LREC.

[12]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[13]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Peter Austin,et al.  The Cambridge handbook of endangered languages , 2011 .

[16]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[17]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[18]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[19]  A. Waibel,et al.  IMPROVING PHONEME SET DISCOVERY FOR DOCUMENTING , 2017 .

[20]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[21]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Jörg Franke,et al.  Towards phoneme inventory discovery for documentation of unwritten languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Laura J. Downing,et al.  How intonations interact with tones in Embosi (Bantu C25), a two-tone language without downdrift , 2016 .

[24]  Alexandre Allauzen,et al.  Preliminary Experiments on Unsupervised Word Discovery in Mboshi , 2016, INTERSPEECH.

[25]  Lori Lamel,et al.  Developing an Embosi (Bantu C25) Speech Variant Dictionary to Model Vowel Elision and Morpheme Deletion , 2017, INTERSPEECH.

[26]  David Chiang,et al.  A case study on using speech-to-translation alignments for language documentation , 2017, ArXiv.