Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.

[1]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[2]  Hu Hu,et al.  Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Nikolaus Kriegeskorte,et al.  Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition , 2017, bioRxiv.

[4]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Junqing Yu,et al.  Investigation of Automatic Speech Recognition Systems via the Multilingual Deep Neural Network Modeling Methods for a Very Low-Resource Language, Chaha , 2020 .

[6]  Junqing Yu,et al.  Investigation of Various Hybrid Acoustic Modeling Units via a Multitask Learning and Deep Neural Network Technique for LVCSR of the Low-Resource Language, Amharic , 2019, IEEE Access.

[7]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jia Liu,et al.  Advanced recurrent network-based hybrid acoustic models for low resource speech recognition , 2018, EURASIP J. Audio Speech Music. Process..

[9]  Jia Liu,et al.  Maxout neurons for deep convolutional and LSTM neural networks in speech recognition , 2016, Speech Commun..

[10]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[11]  Rytis Maskeliūnas,et al.  Lithuanian Speech Recognition Using Purely Phonetic Deep Learning , 2019, Comput..

[12]  Steve Renals,et al.  Small-Footprint Highway Deep Neural Networks for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yang Liu,et al.  An Attention-Gated Convolutional Neural Network for Sentence Classification , 2018, Intell. Data Anal..

[14]  Daniel Jurafsky,et al.  Building DNN acoustic models for large vocabulary speech recognition , 2014, Comput. Speech Lang..

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.