Likelihood-maximizing beamforming for robust hands-free speech recognition

Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

[1]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[2]  O. Hoshuyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Ea-Ee Jan,et al.  Spatially selective sound capture for speech and audio processing , 1993, Speech Commun..

[4]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[5]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[6]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[7]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[8]  C. Burrus,et al.  Array Signal Processing , 1989 .

[9]  Richard M. Stern,et al.  A unified approach for robust speech recognition , 1995, EUROSPEECH.

[10]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[11]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[12]  Sofiène Affes,et al.  A signal subspace tracking algorithm for microphone array processing of speech , 1997, IEEE Trans. Speech Audio Process..

[13]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[14]  Jont B. Allen,et al.  Invertibility of a room impulse response , 1979 .

[15]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[16]  Sven Nordholm,et al.  Adaptive microphone array employing calibration signals: an analytical evaluation , 1999, IEEE Trans. Speech Audio Process..

[17]  David G. Long,et al.  Array signal processing , 1985, IEEE Trans. Acoust. Speech Signal Process..

[18]  Philip C. Woodland,et al.  Speaker adaptation of HMMs using linear regression , 1994 .

[19]  Les E. Atlas,et al.  Acoustic diversity for improved speech recognition in reverberant environments , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[22]  Dirk Van Compernolle,et al.  Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings , 1990, ICASSP.

[23]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[24]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..