Prosodic Features and Formant Contribution for Speech Recognition System over Mobile Network

This paper investigates the contribution of formants and prosodic features like pitch and energy on automatic speech recognition system performance in mobile networks especially the GSMEFR (Global System for Mobile Enhanced Full Rate) codec.The front-end of the speech recognition system combines feature extracted by converting the quantized spectral information of speech coder, prosodic information and formant frequencies. The quantized spectral information is represented by the LPC (Linear Predictive Coding) coefficients, the LSF (Line Spectral Frequencies) coefficients, the approximation of the LSF’s to the LPC Cepstral Coefficients (LPCC’s) that are the Pseudo Cepstral Coefficients (PCC) and the Pseudo-Cepstrum (PCEP) coefficients. The achieved speaker-independent speech recognition system is based on Continuous Hidden Markov Model (CHMMs) classifier. The obtained results show that the resulting multivariate feature vectors lead to a significant improvement of the speech recognition system performance in mobile environment, compared to speech coder bit-stream system alone.

[1]  Douglas D. O'Shaughnessy,et al.  Auditory-based Acoustic Distinctive Features and Spectral Cues for Robust Automatic Speech Recognition in Low-SNR Car Environments , 2003, HLT-NAACL.

[2]  Sabri Gurbuz,et al.  Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Redwan Salami,et al.  GSM enhanced full rate speech codec , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Douglas D. O'Shaughnessy,et al.  Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Hwang Soo Lee,et al.  On approximating line spectral frequencies to LPC cepstral coefficients , 2000, IEEE Trans. Speech Audio Process..

[6]  Zheng-Hua Tan,et al.  Automatic speech recognition on mobile devices and over communication networks , 2008 .

[7]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[8]  Gernot A. Fink,et al.  Markov Models for Pattern Recognition , 2014, Advances in Computer Vision and Pattern Recognition.

[9]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[10]  Guoyun Lv,et al.  Multi-stream Asynchrony Modeling for Audio-Visual Speech Recognition , 2007, ISM 2007.

[11]  Abderrahmane Amrouche,et al.  An efficient speech recognition system in adverse conditions using the nonparametric regression , 2010, Eng. Appl. Artif. Intell..

[12]  Abraham Alcaim,et al.  Transformations of LPC and LSF Parameters to Speech Recognition Features , 2005, ICAPR.

[13]  Wai C. Chu,et al.  Speech Coding Algorithms , 2003 .

[14]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[15]  Bo Xu,et al.  Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic and Lexical Information in Maximum Entropy Framework , 2009, 2009 Chinese Conference on Pattern Recognition.

[16]  Rong Tong,et al.  Chinese Dialect Identification Using Tone Features Based on Pitch Flux , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Jean-Pierre Adoul,et al.  Description of GSM enhanced full rate speech codec , 1997, Proceedings of ICC'97 - International Conference on Communications.

[18]  Jean-Pierre Adoul,et al.  Enhanced full rate speech codec for IS-136 digital cellular system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.