Indian Languages ASR: A Multilingual Phone Recognition Framework with IPA Based Common Phone-set, Predicted Articulatory Features and Feature fusion

In this study, a multilingual phone recognition system for four Indian languages Kannada, Telugu, Bengali, and Odia is described. International phonetic alphabets are used to derive the transcription. Multilingual Phone Recognition System (MPRS) is developed using the state-of-the-art DNNs. The performance of MPRS is improved using the Articulatory Features (AFs). DNNs are used to predict the AFs for place, manner, roundness, frontness, and height AF groups. Further, the MPRS is also developed using oracle AFs and their performance is compared with that of predicted AFs. Oracle AFs are used to set the best performance realizable by AFs predicted from MFCC features by DNNs. In addition to the AFs, we have also explored the use of phone posteriors to further boost the performance of MPRS. We show that oracle AFs by feature fusion with MFCCs offer a remarkably low target of PER of 10.4%, which is 24.7% absolute reduction compared to baseline MPRS with MFCCs alone. The best performing system using predicted AFs has shown 2.8% reduction in absolute PER (8% reduction in relative PER) compared to baseline MPRS.

[1]  Daniel Povey,et al.  Revisiting semi-continuous hidden Markov models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tanja Schultz,et al.  Language independent and language adaptive large vocabulary speech recognition , 1998, ICSLP.

[3]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  I. Zlokarnik Adding articulatory features to acoustic features for automatic speech recognition , 1995 .

[6]  Srinivasan Umesh,et al.  Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain , 2014, Speech Commun..

[7]  Simon King,et al.  Speech production knowledge in automatic speech recognition. , 2007, The Journal of the Acoustical Society of America.

[8]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[9]  Jinyu Li,et al.  A study on lattice rescoring with knowledge scores for automatic speech recognition , 2006, INTERSPEECH.

[10]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[11]  S. R. Mahadeva Prasanna,et al.  Development of Assamese Phonetic Engine: Some issues , 2013, 2013 Annual IEEE India Conference (INDICON).

[12]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[14]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Markus Müller,et al.  Using language adaptive deep neural networks for improved multilingual speech recognition , 2015, IWSLT.

[17]  Paul Mermelstein Computer Simulation of Articulatory Activity in Speech Production , 1969, IJCAI.

[18]  K. Sreenivasa Rao,et al.  Phonetic and Prosodically Rich Transcribed speech corpus in Indian languages: Bengali and Odia , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[19]  Lori Lamel,et al.  Multilingual phone recognition of spontaneous telephone speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Simon King,et al.  Speech Recognition Using Linear Dynamic Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Elliot Saltzman,et al.  Articulatory features from deep neural networks and their role in speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Frank Diehl,et al.  Multilingual and crosslingual acoustic modelling for automatic speech recognition , 2007 .

[24]  Tanja Schultz,et al.  Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[25]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 1978 .

[26]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[27]  Veena Karjigi,et al.  Development of Kannada speech corpus for prosodically guided phonetic search engine , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[28]  A. Waibel,et al.  Towards Improving Low-Resource Speech Recognition Using Articulatory and Language Features , 2016, IWSLT.

[29]  Hervé Bourlard,et al.  Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Bayya Yegnanarayana,et al.  Spotting Multilingual Consonant-Vowel Units of Speech Using Neural Network Models , 2005, NOLISP.

[31]  Shubham Sharma,et al.  Development of language resources for speech application in Gujarati and Marathi , 2014, 2014 International Conference on Asian Language Processing (IALP).

[32]  Simon King,et al.  Articulatory feature classifiers trained on 2000 hours of telephone speech , 2007, INTERSPEECH.

[33]  Brian M. Ore Multilingual Articulatory Features for Speech Recognition , 2007 .

[34]  Andreas Stolcke,et al.  Articulatory trajectories for large-vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Haizhou Li,et al.  Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.