Zero-shot Learning for Speech Recognition with Universal Phonetic Model

There are more than 7,000 languages in the world, but due to the lack of training sets, only a small number of them have speech recognition systems. Multilingual speech recognition provides a solution if at least some audio training data is available. Often, however, phoneme inventories differ between the training languages and the target language, making this approach infeasible. In this work, we address the problem of building an acoustic model for languages with zero audio resources. Our model is able to recognize unseen phonemes in the target language, if only a small text corpus is available. We adopt the idea of zero-shot learning, and decompose phonemes into corresponding phonetic attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over phonetic attributes, and then compute phoneme distributions with a customized acoustic model. We extensively evaluate our English-trained model on 20 unseen languages, and find that on average, it achieves 9.9% better phone error rate over a traditional CTC based acoustic model trained on English.

[1]  Themos Stafylakis,et al.  Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[2]  Satoshi Nakamura,et al.  Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Sharon Goldwater,et al.  Multilingual bottleneck features for subword modeling in zero-resource languages , 2018, INTERSPEECH.

[4]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Florian Metze,et al.  Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[7]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  James R. Glass,et al.  One-shot learning of generative speech concepts , 2014, CogSci.

[9]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[10]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Sanja Fidler,et al.  Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  G. D. Magoulas,et al.  Under review as a conference paper at ICLR 2017 , 2017 .

[13]  Mark Hasegawa-Johnson,et al.  Building an ASR System for Mboshi Using A Cross-Language Definition of Acoustic Units Approach , 2018, SLTU.

[14]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[15]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[16]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[17]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[18]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[19]  James R. Glass Towards unsupervised speech processing , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[20]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[22]  P. Ladefoged A course in phonetics , 1975 .

[23]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Bernt Schiele,et al.  Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[25]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[26]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  A. Waibel,et al.  IMPROVING PHONEME SET DISCOVERY FOR DOCUMENTING , 2017 .

[28]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Arthur M. Jacobs,et al.  Statistical analysis of the bidirectional inconsistency of spelling and sound in French , 1996 .

[30]  Tanja Schultz,et al.  Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[31]  John C. Wells,et al.  Computer-coding the IPA: a proposed extension of SAMPA , 1995 .

[32]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[33]  Ngoc Thang Vu,et al.  Multilingual multilayer perceptron for rapid language adaptation between and across language families , 2013, INTERSPEECH.

[34]  Hynek Hermansky,et al.  Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[35]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[36]  Hervé Bourlard,et al.  An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[37]  Lukás Burget,et al.  Topic identification of spoken documents using unsupervised acoustic unit discovery , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  P. Lewis Ethnologue : languages of the world , 2009 .

[39]  Chris Dyer,et al.  PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors , 2016, COLING.

[40]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[41]  Sanjeev Khudanpur,et al.  Topic Identification for Speech Without ASR , 2017, INTERSPEECH.

[42]  Siddharth Dalmia,et al.  Epitran: Precision G2P for Many Languages , 2018, LREC.

[43]  Florian Metze,et al.  Domain Robust Feature Extraction for Rapid Low Resource ASR Development , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[44]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[45]  Rainer Stiefelhagen,et al.  Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[47]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[48]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[49]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[53]  Ulrike Mosel,et al.  Essentials of language documentation , 2006 .

[54]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[55]  Laurent Besacier,et al.  Developments of Swahili resources for an automatic speech recognition system , 2012, SLTU.

[56]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[57]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[58]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[59]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[60]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[61]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[62]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[63]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[64]  A. Waibel,et al.  Towards Improving Low-Resource Speech Recognition Using Articulatory and Language Features , 2016, IWSLT.

[65]  Solomon Teferra Abate,et al.  Using different acoustic, lexical and language modeling units for ASR of an under-resourced language - Amharic , 2014, Speech Commun..

[66]  Bogdan Ludusan,et al.  Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.

[67]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.