Multilingual and Multimode Phone Recognition System for Indian Languages

The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation, and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (1) Automatic speech mode classification (SMC) and (2) Automatic phonetic recognition using mode-specific multilingual phone recognition system (MPRS). In this work, the vocal tract and excitation source features are considered for speech mode classification (SMC) task. SMC systems are developed using multilayer perceptron (MLP). Further, vocal tract, excitation source, and tandem features are used to build the deep neural network (DNN)-based MPRSs. The performance of the proposed approach is compared with mode-dependent MPRSs. Experimental results show that the proposed approach which combines both SMC and MPRS into a single system outperforms the baseline mode-dependent MPRSs.

[1]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[3]  Haizhou Li,et al.  Multilingual speech recognition: a unified approach , 2005, INTERSPEECH.

[4]  Veena Karjigi,et al.  Development of Kannada speech corpus for prosodically guided phonetic search engine , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[5]  Daniel Svozil,et al.  Introduction to multi-layer feed-forward neural networks , 1997 .

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[8]  Joaquín González-Rodríguez,et al.  On the use of deep feedforward neural networks for automatic language identification , 2016, Comput. Speech Lang..

[9]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  K. Sreenivasa Rao,et al.  Source and system features for phone recognition , 2014, International Journal of Speech Technology.

[11]  V. Ramu Reddy,et al.  Identification of Indian languages using multi-level spectral and prosodic features , 2013, International Journal of Speech Technology.

[12]  Stephan Trenn,et al.  Multilayer Perceptrons: Approximation Order and Necessary Number of Hidden Units , 2008, IEEE Transactions on Neural Networks.

[13]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[14]  Anders Eriksson,et al.  Effect of Language, Speaking Style and Speaker on Long-Term F0 Estimation , 2017, INTERSPEECH.

[15]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  K. Sreenivasa Rao,et al.  Phonetic and Prosodically Rich Transcribed speech corpus in Indian languages: Bengali and Odia , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[17]  Oriol Vinyals,et al.  Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Anders Eriksson,et al.  On the robustness of some acoustic parameters for signalling word stress across styles in Brazilian Portuguese , 2013, INTERSPEECH.

[19]  John H. L. Hansen,et al.  Analysis and compensation of stressed and noisy speech with application to robust automatic recognition , 1988 .

[20]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[21]  K. Sreenivasa Rao,et al.  Parametric representation of excitation source information for language identification , 2017, Comput. Speech Lang..

[22]  E. Nöth,et al.  Can You Tell Apart Spontaneous and Read Speech if You just Look at Prosody , 1995 .

[23]  Adrian Leemann,et al.  The recognition of read and spontaneous speech in local vernacular: The case of Zurich German , 2015, J. Phonetics.

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[25]  Dong Yu,et al.  Deep Neural Network-Hidden Markov Model Hybrid Systems , 2015 .

[26]  K. Sreenivasa Rao,et al.  Development and analysis of multilingual phone recognition systems using Indian languages , 2019, Int. J. Speech Technol..

[27]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[28]  Srinivasan Umesh,et al.  Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain , 2014, Speech Commun..

[29]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[30]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[31]  Jia Uddin,et al.  Speech Recognition Using Feed Forward Neural Network and Principle Component Analysis , 2017, SIRS.

[32]  R. Pradeep,et al.  Deep neural networks for kannada phoneme recognition , 2016, 2016 Ninth International Conference on Contemporary Computing (IC3).

[33]  Murat Hüsnü Sazli,et al.  Speech recognition with artificial neural networks , 2010, Digit. Signal Process..

[34]  Bayya Yegnanarayana,et al.  Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Mattias Heldner,et al.  The acoustics of word stress in English as a function of stress level and speaking style , 2015, INTERSPEECH.

[36]  Bayya Yegnanarayana,et al.  Spotting Multilingual Consonant-Vowel Units of Speech Using Neural Network Models , 2005, NOLISP.

[37]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.