Acoustic synthesis of training data for speech recognition in living room environments

Despite continuous progress in robust automatic speech recognition acoustic mismatch between training and test conditions is still a major problem. Consequently, large speech collections must be conducted in many environments. An alternative approach is to generate training data synthetically by filtering clean speech with impulse responses and/or adding noise signals from the target domain. We compare the performance of a speech recognizer trained on recorded speech in the target domain with a system trained on suitably transformed clean speech. In order to obtain comparable results, our experiments are based on two channel recordings with a close talk and a distant microphone which produce the clean signal and the target domain signal respectively. By filtering and adding noise we obtain error rates which are only 10% higher for natural number recognition and 30% higher for a command recognition task compared to training with target domain data.

[1]  Maurizio Omologo,et al.  Training of HMM with filtered speech material for hands-free recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[3]  Maurizio Omologo,et al.  Environmental conditions and acoustic transduction in hands-free speech recognition , 1998, Speech Commun..

[4]  Alexander Fischer,et al.  Domain adaptation for robust automatic speech recognition in car environments , 1999, EUROSPEECH.

[5]  Maurizio Omologo,et al.  Hands-free speech recognition using a filtered clean corpus and incremental HMM adaptation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).