An Exploration towards Joint Acoustic Modeling for Indian Languages: IIIT-H Submission for Low Resource Speech Recognition Challenge for Indian Languages, INTERSPEECH 2018

India being a multilingual society, a multilingual automatic speech recognition system (ASR) is widely appreciated. Despite different orthographies, Indian languages share same phonetic space. To exploit this property, a joint acoustic model has been trained for developing multilingual ASR system us-ing a common phone-set. Three Indian languages namely Tel-ugu, Tamil and, Gujarati are considered for the study. This work studies the amenability of two different acoustic modeling approaches for training a joint acoustic model using common phone-set. Sub-space Gaussian mixture models (SGMM), and recurrent neural networks (RNN) trained with connectionst temporal classification (CTC) objective function are explored for training joint acoustic models. From the experimental re-sults, it can be observed that the joint acoustic models trained with RNN-CTC have performed better than SGMM system even on 120 hours of data (approx 40 hrs per language). The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efficient data sharing across the languages. Conditioning the joint model with language identity had a minimal advantage. Sub-sampling the features by a factor of 2 while training RNN-CTC models has reduced the training times and has performed better.

[1]  Suryakanth V. Gangashetty,et al.  Adapting monolingual resources for code-mixed hindi-english speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[2]  Hervé Bourlard,et al.  Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model , 2017, ArXiv.

[3]  A. Waibel,et al.  Phonemic and Graphemic Multilingual CTC Based Speech Recognition , 2017, ArXiv.

[4]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hari Krishna Vydana,et al.  Residual neural networks for speech recognition , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[6]  Hari Krishna Vydana,et al.  Significance of neural phonotactic models for large-scale spoken language identification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[7]  Hema A. Murthy,et al.  A Unified Parser for Developing Indian Language Text to Speech Synthesizers , 2016, TSD.

[8]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kyunghyun Cho,et al.  A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[11]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[13]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[14]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Kishore Prahallad,et al.  A speech-based conversation system for accessing agriculture commodity prices in Indian languages , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[21]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Daniel Povey,et al.  A Tutorial-style Introduction to Subspace Gaussian Mixture Models for Speech Recognition , 2009 .

[23]  Thomas Niesler,et al.  Language-dependent state clustering for multilingual acoustic modelling , 2007, Speech Commun..

[24]  Ganapathiraju Madhavi,et al.  Om: one tool for many (Indian) languages , 2005 .

[25]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[26]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[28]  Tara N. Sainath,et al.  Top Downloads in IEEE Xplore [Reader's Choice] , 2017, IEEE Signal Processing Magazine.

[29]  Srinivasan Umesh,et al.  Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain , 2014, Speech Commun..

[30]  Hema A. Murthy,et al.  A common attribute based unified HTS framework for speech synthesis in Indian languages , 2013, SSW.

[31]  Kishore Prahallad,et al.  The IIIT-H Indic Speech Databases , 2012, INTERSPEECH.

[32]  R. Chitturi,et al.  Development of Indian Language Speech Databases for Large Vocabulary Speech Recognition Systems , 2005 .

[33]  Haizhou Li,et al.  Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.

[34]  Faustino J. Gomez,et al.  Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks , 2022 .