Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

The polarization of world languages is becoming more and more obvious. Many languages, mainly endangered languages, are of low-resource attribute due to lack of information. Both language conservation and cultural heritage face important challenges. Therefore, speech recognition for low- resource scenario has become a hot topic in the field of speech. Based on the complex network structures and huge model parameters, deep learning has become a powerful science in the process of speech recognition, which has a broad and far-reaching significance for the study of low-resource speech recognition. Aiming at the characteristic of low resource, this article reviews the history and research status of two kinds of acoustic models of deep learning neural networks and acoustic end-to-end structures. We further elaborate on several key techniques for improving performance in the two aspects of data and model training. There are two projects for low-resource languages introduced in this article. The possible future developments are finally pointed out. These works provide some reference for computer speech and language processing.

[1]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Raymond Ptucha,et al.  Synthetic Data Augmentation for Improving Low-Resource ASR , 2019, 2019 IEEE Western New York Image and Signal Processing Workshop (WNYISPW).

[3]  Maxim Korenevsky,et al.  Exploring End-to-End Techniques for Low-Resource Speech Recognition , 2018, SPECOM.

[4]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Li Deng,et al.  An Overview of Deep-Structured Learning for Information Processing , 2011 .

[6]  William Hartmann,et al.  Learning from the Best: A Teacher-student Multilingual Framework for Low-resource Languages , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Shaohe Lv,et al.  An Overview of End-to-End Automatic Speech Recognition , 2019, Symmetry.

[8]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[9]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[10]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Shuang Xu,et al.  Multilingual Recurrent Neural Networks with Residual Learning for Low-Resource Speech Recognition , 2017, INTERSPEECH.

[12]  Joaquin Vanschoren,et al.  Meta-Learning: A Survey , 2018, Automated Machine Learning.

[13]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[14]  Jianhua Tao,et al.  Language-invariant Bottleneck Features from Adversarial End-to-end Acoustic Models for Low Resource Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[16]  Hynek Hermansky,et al.  Robust speech recognition in unknown reverberant and noisy conditions , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[17]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[19]  Lukás Burget,et al.  BUT OpenSAT 2017 Speech Recognition System , 2018, INTERSPEECH.

[20]  Peter Bell,et al.  Structured output layer with auxiliary targets for context-dependent acoustic modelling , 2015, INTERSPEECH.

[21]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[22]  Bhuvana Ramabhadran,et al.  End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jason Duncan,et al.  Overview of the DARPA LORELEI Program , 2017, Machine Translation.

[25]  Hulya Yalcin,et al.  Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS , 2019, 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD).

[26]  Dirk Van Compernolle,et al.  A study of rank-constrained multilingual DNNS for low-resource ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Chongchong Yu,et al.  Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language , 2019, Symmetry.

[28]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[29]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[30]  Jia Liu,et al.  Gated convolutional networks based hybrid acoustic models for low resource speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[32]  Jianhua Tao,et al.  Adversarial Multilingual Training for Low-Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[35]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[37]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[38]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[39]  Wu Chou,et al.  Robust decision tree state tying for continuous speech recognition , 2000, IEEE Trans. Speech Audio Process..

[40]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  A G Ramakrishnan,et al.  Data-pooling and multi-task learning for enhanced performance of speech recognition systems in multiple low resourced languages , 2019, 2019 National Conference on Communications (NCC).

[42]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[45]  Shuang Xu,et al.  Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling , 2016, INTERSPEECH.

[46]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Kai Yu,et al.  Speaker Augmentation for Low Resource Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[49]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[50]  Boi Faltings,et al.  Meta-Learning for Low-resource Natural Language Generation in Task-oriented Dialogue Systems , 2019, IJCAI.

[51]  Hermann Ney,et al.  Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages , 2014, INTERSPEECH.

[52]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[53]  Shuang Xu,et al.  Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages , 2018, ArXiv.

[54]  Hui Wang,et al.  Multilingual Convolutional, Long Short-Term Memory, Deep Neural Networks for Low Resource Speech Recognition , 2017 .

[55]  Jeff Z. Ma,et al.  Optimizing Multilingual Knowledge Transfer for Time-Delay Neural Networks with Low-Rank Factorization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[58]  Yong Wang,et al.  Meta-Learning for Low-Resource Neural Machine Translation , 2018, EMNLP.

[59]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[60]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[61]  Junqing Yu,et al.  Investigation of Various Hybrid Acoustic Modeling Units via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic , 2019, IEEE Access.

[62]  Chengyi Wang,et al.  Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[63]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[64]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[65]  Richard M. Schwartz,et al.  Two-Stage Data Augmentation for Low-Resourced Speech Recognition , 2016, INTERSPEECH.

[66]  Jia Liu,et al.  Advanced recurrent network-based hybrid acoustic models for low resource speech recognition , 2018, EURASIP J. Audio Speech Music. Process..

[67]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[68]  Florian Metze,et al.  Domain Robust Feature Extraction for Rapid Low Resource ASR Development , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[69]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[70]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  Hari Krishna Vydana,et al.  An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018 , 2018, INTERSPEECH.

[72]  Peter Bell,et al.  Learning to adapt: a meta-learning approach for speaker adaptation , 2018, Interspeech 2018.

[73]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[74]  Srinivasan Umesh,et al.  Addressing data sparsity in DNN acoustic modeling , 2017, 2017 Twenty-third National Conference on Communications (NCC).

[75]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[76]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[77]  Dimitri Palaz,et al.  Towards End-to-End Speech Recognition , 2016 .

[78]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[79]  Meng Cai,et al.  Convolutional maxout neural networks for low-resource speech recognition , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[80]  Bin Ma,et al.  Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[82]  Thomas Niesler,et al.  Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[83]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[84]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[85]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[86]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[87]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[88]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[89]  Xu Wang,et al.  A frequency warping approach for vocal tract length normalization , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[90]  Shrikanth Narayanan,et al.  A system for the 2019 Sentiment, Emotion and Cognitive State Task of DARPA's LORELEI project , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[91]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[92]  Andreas Stolcke,et al.  MLLR transforms as features in speaker recognition , 2005, INTERSPEECH.

[93]  Lukás Burget,et al.  Analysis of Multilingual Blstm Acoustic Model on Low and High Resource Languages , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Bin Ma,et al.  Efficient methods to train multilingual bottleneck feature extractors for low resource keyword search , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Shuang Xu,et al.  A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[96]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[97]  Aiying Zhang 基于多语言语音数据选择的资源稀缺蒙语语音识别研究 (Research on Low-resource Mongolian Speech Recognition Based on Multilingual Speech Data Selection) , 2018, 计算机科学.

[98]  Tanvina Patel,et al.  TDNN-based Multilingual Speech Recognition System for Low Resource Indian Languages , 2018, INTERSPEECH.

[99]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[100]  William Chan,et al.  Deep convolutional neural networks for acoustic modeling in low resource languages , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[102]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[103]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[104]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.