Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models’ input representation increases their translation and alignment quality, specially for challenging language pairs.

[1]  Lucille J. Watahomigie,et al.  Endangered languages. , 1991, Science.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Nicholas Evans,et al.  Searching for meaning in the Library of Babel: field semantics and problems of digital archiving , 2003 .

[4]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[5]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[6]  L. Babel,et al.  Searching for meaning in the Library of Babel : field semantics and problems of digital archiving , 2006 .

[7]  Mark Johnson,et al.  Nonparametric bayesian models of lexical acquisition , 2007 .

[8]  Caren Brinckmann,et al.  Transcription bottleneck of speech corpus exploitation , 2008 .

[9]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[10]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[11]  Charles Yang,et al.  Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources , 2010, CoNLL.

[12]  Martin Haspelmath,et al.  The indeterminacy of word segmentation and the nature of morphology and syntax , 2011 .

[13]  Peter Austin,et al.  The Cambridge handbook of endangered languages , 2011 .

[14]  Florian Schiel,et al.  Untrained Forced Alignment of Transcriptions and Audio for Language Documentation Corpora using WebMAUS , 2014, LREC.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[17]  Wen Wang,et al.  Toward human-assisted lexical unit discovery without text resources , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[18]  Alexandre Allauzen,et al.  Preliminary Experiments on Unsupervised Word Discovery in Mboshi , 2016, INTERSPEECH.

[19]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[20]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[21]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Florian Schiel,et al.  Multilingual processing of speech via web services , 2017, Comput. Speech Lang..

[23]  Antonios Anastasopoulos,et al.  A small Griko-Italian speech translation corpus , 2018, SLTU.

[24]  Aline Villavicencio,et al.  Unsupervised Word Segmentation from Speech with Attention , 2018, INTERSPEECH.

[25]  David Chiang,et al.  Leveraging translations for speech transcription in low-resource settings , 2018, INTERSPEECH.

[26]  Scott Heath,et al.  Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS) , 2018, SLTU.

[27]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[28]  Graham Neubig,et al.  Integrating automatic transcription into the language documentation workflow: Experiments with Na data and the Persephone toolkit , 2018 .

[29]  Laurent Besacier,et al.  Controlling Utterance Length in NMT-based Word Segmentation with Attention , 2019, IWSLT.

[30]  Aline Villavicencio,et al.  Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings , 2019, INTERSPEECH.

[31]  Satoshi Nakamura,et al.  Speech-to-Speech Translation Between Untranscribed Unknown Languages , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Laurent Besacier,et al.  MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible , 2019, LREC.