Enhanced power-normalized features for mandarin robust speech recognition based on a voiced-unvoiced-silence decision

Power-normalized features have been shown to improve the performance of English large vocabulary continuous speech recognition under different acoustic conditions. In this paper, considering tone characteristics of Mandarin speech, we adopt different strategies to deal with different sounds based on a voiced-unvoiced-silence decision of sounds. For voiced sounds, harmonic enhancement based on a weighted harmonic-noise-model (WHNM) is applied to accurately capture the salient harmonic information and decreases the effect of various non-stationary noises. After this, standard power-normalized processing (SPNP) is performed. For unvoiced sounds, the SPNP is only used. For silence sounds, an quality frame dropping (FD) algorithm is incorporated into the front-end properly. As a result, enhanced power-normalized features are obtained and used to process noise-corrupted Mandarin speech. The experimental results show better recognition accuracies for Mandarin continuous speech recognition in noisy environments over the ETSI Advanced Front-End (AFE).

[1]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[2]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[3]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[4]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Steven Greenberg,et al.  Robust speech recognition using the modulation spectrogram , 1998, Speech Commun..

[6]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[7]  Alex Acero,et al.  A harmonic-model-based front end for robust speech recognition , 2003, INTERSPEECH.

[8]  Hermann Ney,et al.  Extraction methods of voicing feature for robust speech recognition , 2003, INTERSPEECH.

[9]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[10]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[11]  T. Herzke,et al.  Improved numerical methods for gammatone filterbank analysis and synthesis , 2007 .

[12]  Volker Hohmann,et al.  Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition , 2009, EURASIP J. Audio Speech Music. Process..

[13]  Volker Hohmann,et al.  Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency , 2011, Speech Commun..

[14]  Power-Normalized Cepstral Coefficients (PNCC) for robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Douglas D. O'Shaughnessy,et al.  Robust Feature Extraction for Speech Recognition by Enhancing Auditory Spectrum , 2012, INTERSPEECH.

[16]  Abeer Alwan,et al.  A pitch-based spectral enhancement technique for robust speech processing , 2013, INTERSPEECH.

[17]  Bernd T. Meyer,et al.  Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.