Improvement of phone recognition accuracy using speech mode classification

In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%.

[1]  K. Sreenivasa Rao,et al.  Phonetic and Prosodically Rich Transcribed speech corpus in Indian languages: Bengali and Odia , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[4]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[5]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[6]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[7]  Lori M. Bruce,et al.  Decision level fusion with best-bases for hyperspectral classification , 2003, IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data, 2003.

[8]  Faran Awais Butt,et al.  Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals , 2013, 2013 The International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE).

[9]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[10]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  K. Sreenivasa Rao,et al.  Articulatory and excitation source features for speech recognition in read, extempore and conversation modes , 2016, Int. J. Speech Technol..

[12]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[13]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using source, system, and prosodic features , 2012, Int. J. Speech Technol..

[14]  K. Sreenivasa Rao,et al.  Automatic Phonetic Transcription for read, extempore and conversation speech for an Indian language: Bengali , 2014, 2014 Twentieth National Conference on Communications (NCC).

[15]  K. Sreenivasa Rao,et al.  Source and system features for phone recognition , 2014, International Journal of Speech Technology.

[16]  Md. Rabiul Islam,et al.  Feature and Score Fusion Based Multiple Classifier Selection for Iris Recognition , 2014, Comput. Intell. Neurosci..

[17]  Sukhendu Das,et al.  A Survey of Decision Fusion and Feature Fusion Strategies for Pattern Classification , 2010, IETE Technical Review.

[18]  K. Sreenivasa Rao,et al.  Speech Recognition Using Articulatory and Excitation Source Features , 2017 .

[19]  K. Sreenivasa Rao,et al.  Robust Pitch Extraction Method for the HMM-Based Speech Synthesis System , 2017, IEEE Signal Processing Letters.

[20]  U. Bhattacharjee,et al.  Language identification system using MFCC and prosodic features , 2013, 2013 International Conference on Intelligent Systems and Signal Processing (ISSP).

[21]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[22]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.