A small Griko-Italian speech translation corpus

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.

[1]  Dario De Santis,et al.  Griko and Modern Greek in Grecia Salentina: an overview , 2015 .

[2]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[3]  G. Rohlfs,et al.  Grammatica storica dei dialetti italogreci : (Calabria, Salento) , 1977 .

[4]  Steven Bird,et al.  Collecting Bilingual Audio in Remote Indigenous Communities , 2014, COLING.

[5]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[7]  G. Horrocks Greek: A History of the Language and Its Speakers , 1997 .

[8]  Charles Yang,et al.  Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources , 2010, CoNLL.

[9]  Sebastian Stüker,et al.  Towards human translations guided language discovery for ASR systems , 2008, SLTU.

[10]  Tanja Schultz,et al.  Word segmentation through cross-lingual word-to-phoneme alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[13]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[15]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[16]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[17]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[19]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[20]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[21]  David Chiang,et al.  Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource , 2018, COLING.

[22]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Adam Lopez,et al.  Spoken Term Discovery for Language Documentation using Translations , 2017, SCNLP@EMNLP 2017.

[24]  Peter Austin,et al.  The Cambridge handbook of endangered languages , 2011 .

[25]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[26]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech , 2017, ArXiv.

[27]  Alexandre Allauzen,et al.  Preliminary Experiments on Unsupervised Word Discovery in Mboshi , 2016, INTERSPEECH.

[28]  Hermann Ney,et al.  On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[29]  Pierre Gançarski,et al.  A global averaging method for dynamic time warping, with applications to clustering , 2011, Pattern Recognit..

[30]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[31]  Wen Wang,et al.  Toward human-assisted lexical unit discovery without text resources , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[32]  Sebastian Stüker,et al.  BULBasaa: A Bilingual Basaa-French Speech Corpus for the Evaluation of Language Documentation Tools , 2018, LREC.

[33]  Adam Lopez,et al.  Weakly supervised spoken term discovery using cross-lingual side information , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[35]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[36]  Hermann Ney,et al.  Statistical multi-source translation , 2001, MTSUMMIT.

[37]  David Chiang,et al.  An Unsupervised Probability Model for Speech-to-Translation Alignment of Low-Resource Languages , 2016, EMNLP.

[38]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[39]  Micha Elsner,et al.  A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability , 2013, EMNLP.

[40]  Steven Bird,et al.  Aikuma: A Mobile App for Collaborative Language Documentation , 2014 .

[41]  Lori Lamel,et al.  Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville) , 2018, LREC.

[42]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[44]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[45]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[46]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.