Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end

The aim of this work is to enable a noise-free time-domain speech signal to be reconstructed from a stream of MFCC vectors and fundamental frequency and voicing estimates, such as may be received in a distributed speech recognition system. To facilitate reconstruction, both a sinusoidal model and a source-filter model of speech are compared by listening tests and spectrogram analysis, with the result that the former provides higher quality speech reconstruction. Analysis of the sinusoidal model shows that for clean speech reconstruction, both a noise-free spectral envelope and a robust estimate of the fundamental frequency and voicing are necessary. Investigation into fundamental frequency estimation reveals that an auditory model based approach gives superior performance over other methods of estimation. This leads to the proposal of an integrated front-end which uses the auditory model for both fundamental frequency and voicing estimation, and as the filterbank stage in MFCC extraction, and thereby reduces computation. Applying spectral subtraction to the auditory model parameters improves the spectral envelope estimates needed for clean speech reconstruction. Experiments on the Aurora connected digits database show that the auditory model-based MFCCs give comparable performance to that attained with conventional MFCCs. Speech reconstruction tests reveal that the combination of robust fundamental frequency and voicing estimation with spectral subtraction in the integrated front-end leads to intelligible and relatively noise-free speech.

[1]  Roy D. Patterson,et al.  SVOS final report : The auditory filterbank , 1988 .

[2]  J P Martens,et al.  Pitch and voiced/unvoiced determination with an auditory model. , 1992, The Journal of the Acoustical Society of America.

[3]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[4]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[5]  Bhiksha Raj,et al.  Distributed speech recognition with codec parameters , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[6]  James F. Kaiser,et al.  Some useful properties of Teager's energy operators , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[8]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[9]  Meir Tzur,et al.  Efficient periodicity extraction based on sine-wave representation and its application to pitch determination of speech signals , 2001, INTERSPEECH.

[10]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[12]  Roger C. F. Tucker,et al.  Compression of acoustic features - are perceptual quality and recognition performance incompatible goals? , 1999, EUROSPEECH.

[13]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[14]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[16]  Saeed Vaseghi,et al.  Noise compensation methods for hidden Markov model speech recognition in adverse environments , 1997, IEEE Trans. Speech Audio Process..

[17]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[18]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Hong Kook Kim,et al.  A bitstream-based front-end for wireless speech recognition on IS-136 communications system , 2001, IEEE Trans. Speech Audio Process..

[20]  Richard M. Stern,et al.  Speech recognition from GSM codec parameters , 1998, ICSLP.

[21]  METHODS FOR SUBJECTIVE DETERMINATION OF TRANSMISSION QUALITY Summary , 2022 .

[22]  Jean Rouat,et al.  A pitch determination and voiced/unvoiced decision algorithm for noisy speech , 1995, Speech Commun..

[23]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .