Speech Technology for Unwritten Languages

Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this article takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible.

[1]  Matthias Sperber,et al.  XNMT: The eXtensible Neural Machine Translation Toolkit , 2018, AMTA.

[2]  B. Nash-Webber,et al.  Semantic support for a speech understanding system , 1975 .

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[6]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Christos Faloutsos,et al.  Automatic image captioning , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[11]  Martine Adda-Decker,et al.  Parallel Speech Collection for Under-resourced Language Studies Using the Lig-Aikuma Mobile Device App , 2016, SLTU.

[12]  Kevin Duh,et al.  The JHU/KyotoU Speech Translation System for IWSLT 2018 , 2018, IWSLT.

[13]  Mark Hasegawa-Johnson,et al.  Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition , 2012 .

[14]  Mattia Antonino Di Gangi,et al.  Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018 , 2018, IWSLT.

[15]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[16]  Mauro Cettolo,et al.  The IWSLT 2018 Evaluation Campaign , 2018, IWSLT.

[17]  James R. Glass,et al.  Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Graham Neubig,et al.  Neural Machine Translation and Sequence-to-sequence Models: A Tutorial , 2017, ArXiv.

[20]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[21]  Grzegorz Chrupala,et al.  Encoding of phonology in a recurrent neural model of grounded speech , 2017, CoNLL.

[22]  N. Umeda,et al.  Linguistic rules for text-to-speech synthesis , 1976, Proceedings of the IEEE.

[23]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[24]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[25]  Frantisek Grézl,et al.  Multilingually trained bottleneck features in spoken language recognition , 2017, Comput. Speech Lang..

[26]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.

[27]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[28]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech , 2017, ArXiv.

[29]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[30]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Alan W. Black,et al.  Random forests for statistical speech synthesis , 2015, INTERSPEECH.

[33]  Elena Lloret,et al.  Improving Automatic Image Captioning Using Text Summarization Techniques , 2010, TSD.

[34]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[35]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[36]  A. Waibel,et al.  The 2014 KIT IWSLT speech-to-text systems for English, German and Italian , 2014, IWSLT.

[37]  Eric Fosler-Lussier CONTEXTUAL WORD AND SYLLABLE PRONUNCIATION MODELS , 1999 .

[38]  Alan W. Black,et al.  Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[40]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[41]  I. A. Richards,et al.  The Meaning of Meaning: a Study of the Influence of Language upon Thought and of the Science of Symbolism , 1923, Nature.

[42]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[43]  Thierry Dutoit,et al.  High-quality speech synthesis for phonetic speech segmentation , 1997, EUROSPEECH.

[44]  Ian Maddieson,et al.  Patterns of sounds , 1986 .

[45]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[46]  Sebastian Stüker,et al.  Innovative technologies for under-resourced language documentation: The BULB Project , 2016 .

[47]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[48]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[49]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[50]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[51]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[52]  Hermann Ney,et al.  Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system , 2009, INTERSPEECH.

[53]  James R. Glass,et al.  Towards Visually Grounded Sub-word Speech Unit Discovery , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[55]  Mark Hasegawa-Johnson,et al.  Image 2 speech : Automatically generating audio descriptions of images , 2017 .

[56]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[57]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[59]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[60]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[61]  Tanja Schultz,et al.  Experiments on cross-language acoustic modeling , 2001, INTERSPEECH.

[62]  Sebastian Stüker,et al.  A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments , 2017, LREC.

[63]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[64]  Zhen-Hua Ling,et al.  Enhancing Sentence Embedding with Generalized Pooling , 2018, COLING.

[65]  A. Black,et al.  Building an ASR System for a Low-resource Language Through the Adaptation of a High-resource Language ASR System: Preliminary Results , 2017 .

[66]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[67]  Mark Hasegawa-Johnson,et al.  Building an ASR System for Mboshi Using A Cross-Language Definition of Acoustic Units Approach , 2018, SLTU.