Data-pooling and multi-task learning for enhanced performance of speech recognition systems in multiple low resourced languages

We present two approaches to improve the performance of automatic speech recognition (ASR) systems for Gujarati, Tamil and Telugu. In the first approach using data-pooling with phone mapping (DP-PM), a deep neural network (DNN) is trained to predict the senones for the target language; then we use the feature vectors and their alignments from other source languages to map the phones from the source to the target language. The lexicons of the source languages are then modified using this phone mapping and an ASR system for the target language is trained using both the target and the modified source data. This DP-PM approach gives relative improvements in word error rates (WER) of 5.1% for Gujarati, 3.1% for Tamil and 3.4% for Telugu, over the corresponding baseline figures. In the second approach using multi-task DNN (MT-DNN) modeling, we use feature vectors from all the languages and train a DNN with three output layers, each predicting the senones of one of the languages. Objective functions of the output layers are modified such that during training, only those DNN layers responsible for predicting the senones of a language are updated, if the feature vector belongs to that language. This MT-DNN approach achieves relative improvements in WER of 5.7%, 3.3% and 5.2% for Gujarati, Tamil and Telugu, respectively.

[1]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[2]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  A. G. Ramakrishnan,et al.  GRAPHEME TO PHONEME CONVERSION FOR TAMIL SPEECH SYNTHESIS , 2007 .

[4]  A. G. Ramakrishnan,et al.  Transliteration of Indic languages to Kannada with a user-friendly interface , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[5]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[7]  Madhavaraj Ayyavu,et al.  Online Speech Translation System for Tamil , 2018, INTERSPEECH.

[8]  Etienne Barnard,et al.  Pooling ASR data for closely related languages , 2010, SLTU.

[9]  Srinivasan Umesh,et al.  Acoustic modeling using transform-based phone-cluster adaptive training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  A. G. Ramakrishnan,et al.  Design and development of a large vocabulary, continuous speech recognition system for Tamil , 2017, 2017 14th IEEE India Council International Conference (INDICON).