论文信息 - Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop

Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.

[1] Aren Jansen,et al. Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[2] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[3] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech , 2017, ArXiv.

[4] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[5] Aren Jansen,et al. The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6] Martine Adda-Decker,et al. Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[7] Tanja Schultz,et al. Experiments on cross-language acoustic modeling , 2001, INTERSPEECH.

[8] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.

[9] Sebastian Stüker,et al. Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[10] Lou Boves,et al. Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[11] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12] Alan W. Black,et al. CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[13] A. Waibel,et al. IMPROVING PHONEME SET DISCOVERY FOR DOCUMENTING , 2017 .

[14] Keiichi Tokuda,et al. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[15] Olivier Pietquin,et al. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[16] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[17] Aren Jansen,et al. Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[18] Ryan P. Adams,et al. Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[19] Lukás Burget,et al. Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[20] Chng Eng Siong,et al. A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[21] Yee Whye Teh,et al. Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[22] Aline Villavicencio,et al. Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23] Hermann Ney,et al. Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[24] Mark Hasegawa-Johnson,et al. Image 2 speech : Automatically generating audio descriptions of images , 2017 .

[25] Bogdan Ludusan,et al. Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[26] Kenneth Ward Church,et al. A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] James R. Glass,et al. Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] Deep Sen,et al. IS TECTORIAL MEMBRANE FILTERING REQUIRED TO EXPLAIN TWO TONE SUPPRESSION AND THE UPWARD SPREAD OF MASKING , 2000 .

[29] Sanjeev Khudanpur,et al. Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[30] Sebastian Stüker,et al. A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[31] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[32] A. Black,et al. Building an ASR System for a Low-resource Language Through the Adaptation of a High-resource Language ASR System: Preliminary Results , 2017 .

[33] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[34] Olivier Rosec,et al. SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, ArXiv.

[35] James R. Glass,et al. Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[36] Hynek Hermansky,et al. Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.