论文信息 - An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

India being a multilingual society, a multilingual automatic speech recognition system (ASR) is widely appreciated. Despite different orthographies, Indian languages share same phonetic space. To exploit this property, a joint acoustic model has been trained for developing multilingual ASR system us-ing a common phone-set. Three Indian languages namely Tel-ugu, Tamil and, Gujarati are considered for the study. This work studies the amenability of two different acoustic modeling approaches for training a joint acoustic model using common phone-set. Sub-space Gaussian mixture models (SGMM), and recurrent neural networks (RNN) trained with connectionst temporal classiﬁcation (CTC) objective function are explored for training joint acoustic models. From the experimental re-sults, it can be observed that the joint acoustic models trained with RNN-CTC have performed better than SGMM system even on 120 hours of data (approx 40 hrs per language). The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efﬁcient data sharing across the languages. Conditioning the joint model with language identity had a minimal advantage. Sub-sampling the features by a factor of 2 while training RNN-CTC models has reduced the training times and has performed better.

[1] Suryakanth V. Gangashetty,et al. Adapting monolingual resources for code-mixed hindi-english speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[2] Hervé Bourlard,et al. Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model , 2017, ArXiv.

[3] A. Waibel,et al. Phonemic and Graphemic Multilingual CTC Based Speech Recognition , 2017, ArXiv.

[4] Tara N. Sainath,et al. Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Hari Krishna Vydana,et al. Residual neural networks for speech recognition , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[6] Hari Krishna Vydana,et al. Significance of neural phonotactic models for large-scale spoken language identification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[7] Hema A. Murthy,et al. A Unified Parser for Developing Indian Language Text to Speech Synthesizers , 2016, TSD.

[8] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Kyunghyun Cho,et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[11] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[13] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[14] Johan Schalkwyk,et al. Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Steve Renals,et al. Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Georg Heigold,et al. Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20] Kishore Prahallad,et al. A speech-based conversation system for accessing agriculture commodity prices in Indian languages , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[21] Kai Feng,et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Daniel Povey,et al. A Tutorial-style Introduction to Subspace Gaussian Mixture Models for Speech Recognition , 2009 .

[23] Thomas Niesler,et al. Language-dependent state clustering for multilingual acoustic modelling , 2007, Speech Commun..

[24] Ganapathiraju Madhavi,et al. Om: one tool for many (Indian) languages , 2005 .

[25] Tanja Schultz,et al. Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[26] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.

[27] Tanja Schultz,et al. Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[28] Tara N. Sainath,et al. Top Downloads in IEEE Xplore [Reader's Choice] , 2017, IEEE Signal Processing Magazine.

[29] Srinivasan Umesh,et al. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain , 2014, Speech Commun..

[30] Hema A. Murthy,et al. A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[31] Kishore Prahallad,et al. The IIIT-H Indic Speech Databases , 2012, INTERSPEECH.

[32] R. Chitturi,et al. Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems , 2005 .

[33] Haizhou Li,et al. Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.

[34] Faustino J. Gomez,et al. Connectionist Temporal Classiﬁcation: Labelling Unsegmented Sequence Data with Recurrent Neural Networks , 2022 .