Robust Acoustic Speech Feature Prediction From Noisy Mel-Frequency Cepstral Coefficients

This paper examines the effect of applying noise compensation to acoustic speech feature prediction from noisy mel-frequency cepstral coefficient (MFCC) vectors within a distributed speech recognition architecture. An acoustic speech feature (comprising fundamental frequency, formant frequencies, speech/nonspeech classification, and voicing classification) is predicted from an MFCC vector in a maximum a posteriori (MAP) framework using phoneme-specific or global models of speech. The effect of noise is considered and three different noise compensation methods, that have been successful in robust speech recognition, are integrated within the MAP framework. Experiments show that noise compensation can be applied successfully to prediction with best performance given by a model adaptation method that performs only slightly worse than matched training and testing. Further experiments consider application of the predicted acoustic features to speech reconstruction. A series of human listening tests show that the predicted features are sufficient for speech reconstruction and that noise compensation improves speech quality in noisy conditions.

[1]  Yifan Gong,et al.  A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions , 2009, Computer Speech and Language.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Li Deng,et al.  Evaluation of SPLICE on the Aurora 2 and 3 tasks , 2002, INTERSPEECH.

[4]  Saeed Vaseghi,et al.  Applying noise compensation methods to robustly predict acoustic speech features from MFCC vectors in noise , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[6]  Jonathan Darch Robust acoustic speech feature prediction from Mel frequency cepstral coefficients , 2008 .

[7]  Saeed Vaseghi,et al.  Noise compensation methods for hidden Markov model speech recognition in adverse environments , 1997, IEEE Trans. Speech Audio Process..

[8]  Saeed Vaseghi,et al.  MAP prediction of formant frequencies and voicing class from MFCC vectors in noise , 2006, Speech Commun..

[9]  T. Hirahara On the role of the fundamental frequency in vowel perception , 1988 .

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[12]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[14]  Xu Shao,et al.  Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[16]  Qin Yan,et al.  Formant tracking linear prediction model using HMMs and Kalman filters for noisy speech processing , 2007, Comput. Speech Lang..

[17]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[18]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[19]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.