Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems

The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recogniser's front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.In this paper, two innovative front-end processing techniques for noise robust speech recognition are presented and compared, time-domain based frame-attenuation (TD-FrAtt) and frequency-domain based frame-attenuation (FD-FrAtt). These techniques include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure. Tests are performed using the Slovenian SpeechDat II fixed telephone database and the Aurora 2 database together with the HTK speech recognition toolkit. The results obtained are especially encouraging for mobile DSR systems with limited sizes of available memory and processing power.

[1]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Lou Boves,et al.  Annotation in the SpeechDat Projects , 2001, Int. J. Speech Technol..

[3]  John H. L. Hansen,et al.  Robust digit recognition in noise: an evaluation using the AURORA corpus , 2001, INTERSPEECH.

[4]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[5]  Zdravko Kacic,et al.  A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm , 2001, INTERSPEECH.

[6]  Z. Kacic,et al.  The design of mobile multimodal communication device - personal navigator , 2001, EUROCON'2001. International Conference on Trends in Communications. Technical Program, Proceedings (Cat. No.01EX439).

[7]  Sharon L. Oviatt Multimodal signal processing in naturalistic noisy environments , 2000, INTERSPEECH.

[8]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[9]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[10]  Christophe Beaugeant,et al.  Recognition performance of the siemens front-end with and without frame dropping on the Aurora 2 database , 2001, INTERSPEECH.

[11]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[12]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[13]  Darren Pearce,et al.  Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities , 2000 .

[14]  Narada D. Warakagoda,et al.  A Noise Robust Multilingual Reference Recogniser Based on Speechdat(II) , 2000, INTERSPEECH.

[15]  Hynek Hermansky,et al.  Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.