论文信息 - A small Griko-Italian speech translation corpus

A small Griko-Italian speech translation corpus

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.

[1] Dario De Santis,et al. Griko and Modern Greek in Grecia Salentina: an overview , 2015 .

[2] Sebastian Stüker,et al. A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[3] G. Rohlfs,et al. Grammatica storica dei dialetti italogreci : (Calabria, Salento) , 1977 .

[4] Steven Bird,et al. Collecting Bilingual Audio in Remote Indigenous Communities , 2014, COLING.

[5] Aline Villavicencio,et al. Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6] T. Griffiths,et al. A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[7] G. Horrocks. Greek: A History of the Language and Its Speakers , 1997 .

[8] Charles Yang,et al. Recession Segmentation: Simpler Online Word Segmentation Using Limited Resources , 2010, CoNLL.

[9] Sebastian Stüker,et al. Towards human translations guided language discovery for ASR systems , 2008, SLTU.

[10] Tanja Schultz,et al. Word segmentation through cross-lingual word-to-phoneme alignment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[11] Hermann Ney,et al. Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12] Sebastian Stüker,et al. Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[13] Olivier Pietquin,et al. End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] James R. Glass,et al. Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[15] Bogdan Ludusan,et al. Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[16] Olivier Pietquin,et al. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[17] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Lukás Burget,et al. Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[19] Bowen Zhou,et al. TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[20] David Chiang,et al. Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[21] David Chiang,et al. Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource , 2018, COLING.

[22] Kenneth Ward Church,et al. A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23] Adam Lopez,et al. Spoken Term Discovery for Language Documentation using Translations , 2017, SCNLP@EMNLP 2017.

[24] Peter Austin,et al. The Cambridge handbook of endangered languages , 2011 .

[25] David Chiang,et al. An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[26] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech , 2017, ArXiv.

[27] Alexandre Allauzen,et al. Preliminary Experiments on Unsupervised Word Discovery in Mboshi , 2016, INTERSPEECH.

[28] Hermann Ney,et al. On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[29] Pierre Gançarski,et al. A global averaging method for dynamic time warping, with applications to clustering , 2011, Pattern Recognit..

[30] Sebastian Stüker,et al. Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[31] Wen Wang,et al. Toward human-assisted lexical unit discovery without text resources , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[32] Sebastian Stüker,et al. BULBasaa: A Bilingual Basaa-French Speech Corpus for the Evaluation of Language Documentation Tools , 2018, LREC.

[33] Adam Lopez,et al. Weakly supervised spoken term discovery using cross-lingual side information , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Adam Lopez,et al. Towards speech-to-text translation without speech recognition , 2017, EACL.

[35] Martine Adda-Decker,et al. Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[36] Hermann Ney,et al. Statistical multi-source translation , 2001, MTSUMMIT.

[37] David Chiang,et al. An Unsupervised Probability Model for Speech-to-Translation Alignment of Low-Resource Languages , 2016, EMNLP.

[38] Aren Jansen,et al. The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[39] Micha Elsner,et al. A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability , 2013, EMNLP.

[40] Steven Bird,et al. Aikuma: A Mobile App for Collaborative Language Documentation , 2014 .

[41] Lori Lamel,et al. Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville) , 2018, LREC.

[42] Aren Jansen,et al. The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .

[44] Aren Jansen,et al. Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[45] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[46] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.