Enhancing distributed speech recognition with back- end speech reconstruction

In this paper, we present a method to enhance the usefulness of a Distributed Speech Recognition (DSR) system by providing it the capability to reconstruct speech at the backend. Speech reconstruction is achieved using the standard DSR parameters, viz., Mel-Frequency Cepstral Coefficients (MFCC) and log-energy, and some additional parameters, viz., voicing class, pitch period, and (optionally) higherresolution energy information. From the MFCC parameters and energy information, the spectral magnitudes at the harmonics of the pitch frequency are estimated. Based on the class information, the harmonic phases are appropriately modeled. The harmonic magnitudes and phases are used to reconstruct speech according to the well-known sinusoidal model for speech synthesis [4][5]. Transmission of the additional parameters for speech reconstruction increases the DSR bit rate by less than 20%. Evaluation by Mean-OpinionScore (MOS) test and Diagnostic Rhyme Test (DRT) show that speech reconstructed as above is of reasonable quality and quite intelligible.

[1]  William D. Voiers,et al.  Modulated noise reference unit (MNRU) tests using DRT, DAM, and spelling alphabet test materials , 1990 .

[2]  Meir Tzur,et al.  Low bit rate speech compression for playback in speech recognition systems , 2000, 2000 10th European Signal Processing Conference.

[3]  John S. Collura,et al.  MELP: the new Federal Standard at 2400 bps , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[5]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Stephan Euler,et al.  The influence of speech coding algorithms on automatic speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Kuldip K. Paliwal,et al.  Effect of speech coders on speech recognition performance , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Josef Psutka,et al.  Pitch synchronous residual excited speech reconstruction on the MFCC , 2000, 2000 10th European Signal Processing Conference.

[9]  Josef Psutka,et al.  Speech production based on the mel-frequency cepstral coefficients , 1999, EUROSPEECH.