Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages

This paper proposes a multilingual acoustic modeling approach for Indian languages using a Multitask Learning (MTL) framework. Language-specific phoneme recognition is explored as an auxiliary task in MTL framework along with the primary task of multilingual senone classification. This auxiliary task regularizes the primary task with both the context-independent phonemes and language identities induced by language-specific phoneme. The MTL network is also extended by structuring the primary and auxiliary task outputs in the form of a Structured Output Layer (SOL) such that both depend on each other. The experiments are performed using a database of the three Indian languages Gujarati, Tamil, and Telugu. The experimental results show that the proposed MTL-SOL framework performed well compared to baseline monolingual systems with a relative reduction of 3.1-4.4 and 2.9-4.1 % in word error rate for the development and evaluation sets, respectively.

[1]  Anusha Prakash,et al.  Articulatory and Stacked Bottleneck Features for Low Resource Speech Recognition , 2018, INTERSPEECH.

[2]  Lukás Burget,et al.  BUT System for Low Resource Indian Language ASR , 2018, INTERSPEECH.

[3]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  K. Sreenivasa Rao,et al.  Indian Languages ASR: A Multilingual Phone Recognition Framework with IPA Based Common Phone-set, Predicted Articulatory Features and Feature fusion , 2018, INTERSPEECH.

[10]  Thierry Dutoit,et al.  Speaker-aware long short-term memory multi-task learning for speech recognition , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[11]  Jianhua Tao,et al.  Language-Adversarial Transfer Learning for Low-Resource Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Martin Karafiát,et al.  Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[13]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[14]  Vishwas M. Shetty,et al.  Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[16]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[17]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[18]  Myung Jong Kim,et al.  Automatic Speech Recognition with Articulatory Information and a Unified Dictionary for Hindi, Marathi, Bengali and Oriya , 2018, INTERSPEECH.

[19]  Peter Bell,et al.  Multitask Learning of Context-Dependent Targets in Deep Neural Network Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Andrej Zgank,et al.  Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones , 2003, Speech Commun..

[22]  Niranjan Nayak,et al.  Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages , 2018, SLTU.

[23]  Hari Krishna Vydana,et al.  An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018 , 2018, INTERSPEECH.

[24]  Alex Waibel,et al.  Neural Codes to Factor Language in Multilingual Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  P. Fung,et al.  Multilingual spoken language processing , 2008, IEEE Signal Processing Magazine.

[26]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[27]  Bhuvana Ramabhadran,et al.  Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Jayadev Billa,et al.  ISI ASR System for the Low Resource Speech Recognition Challenge for Indian Languages , 2018, INTERSPEECH.

[31]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[33]  Tanvina Patel,et al.  TDNN-based Multilingual Speech Recognition System for Low Resource Indian Languages , 2018, INTERSPEECH.

[34]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Hemant A. Patil,et al.  Advances in Low Resource ASR: A Deep Learning Perspective , 2018, SLTU.

[36]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[37]  Haizhou Li,et al.  Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.

[38]  Peter Bell,et al.  Structured output layer with auxiliary targets for context-dependent acoustic modelling , 2015, INTERSPEECH.

[39]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[40]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.