Simulation of the Hands-free Speech Input to Speech Recognition Systems by Measuring Room Impulse Responses

A hardware and software approach is presented in this paper to measure the room impulse response that defines the transmission of audio signals in a room. This approach was developed within the European SpeeCon project [1]. Graphical user interfaces have been designed to estimate the room impulse response from the recordings of noise signals that are transmitted in the room and to analyse the impulse response with respect to the reverberation time and the corresponding frequency response. The impulse response can be taken to artificially create speech data that contain the influence of a hands-free speech input in a room. We used these speech data to investigate the performance degradation of a speech recognition system for this acoustic input condition. A few exemplary results are presented. 1 Measurement of room impulse response The goal of the European SpeeCon project [1] was the collection of speech data for different languages with a focus on recording speech utterances in hands-free mode inside rooms. This should support the development of recognition systems that allow e.g. the control of electronic devices by a speech input in hands-free mode. To measure the acoustic condition in each individual recording session the hardware set-up shown in figure 1 has been developed. The intention is an estimation of the room impulse response that can be used to describe the transmission of an audio signal in a room. With the impulse response it is possible to individually analyse the acoustic condition of each recording session. Furthermore, speech data can be artificially created that contain the effect of a hands-free speech input in this specific situation. A pink noise and a maximum length sequence (MLS) are played back from a CD player via a loudspeaker. Instead of the usually applied white noise a pink noise with an energy distribution that decreases to higher frequencies is used to compensate the frequency characteristics of the small loudspeaker. The noise signals are recorded with two microphones and stored on a PC as digital signals at a sampling rate of 16 kHz. One microphone is close to the loudspeaker. The second microphone is placed at the desired position in the room where we want to measure the impulse response. An impulse response could be estimated by comparing the recorded signal at microphone 2 with the noise signal as it is stored on the CD player. But in this case the estimated impulse response would also include the transmission characteristics of the loudspeaker. This can be avoided by the second recording close to the loudspeaker. Then, the two microphone signals can be taken to estimate the transmission characteristics between the microphones. We apply two approaches to determine the impulse response either from the recordings of the pink noise or from the recorded MLS sequences. In case of the pink noise we can estimate the power density spectrum for each of the two microphone signals. E.g. the Welch method can be applied where the noise signal is split into segments. For each segment the spectrum is calculated with a DFT. The power density spectrum is determined as average spectrum over all segments. The ratio of the power density spectrum from microphone M2 versus the corresponding spectrum of M1 leads to an estimation of the room transfer function. A MLS sequence has the interesting property that its autocorrelation function is approximately a Dirac impulse. This is especially true for long MLS sequences. We are using a MLS sequence of length 16383. The signal recorded by microphone M2 can be described as the convolution of the signal recorded by microphone M1 and the room impulse response. Figure 1: Hardware set-up to measure the impulse response of a room M1

[1]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[2]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[3]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[4]  B. Cranen,et al.  Automatic Speech Recognition in Adverse Acoustic Conditions , 1999 .