An Analysis of Automatic Speech Recognition with Multiple Microphones

Automatic speech recognition in real world situations often requires the use of microphones distant from speaker’s mouth. One or several microphones are placed in the surroundings to capture many versions of the original signal. Recognition with a single far field microphone yields considerably poorer performance than with person-mounted devices (headset, lapel), with the main causes being reverberation and noise. Acoustic beamforming techniques allow significant improvements over the use of a single microphone, although the overall performance still remains well above the close-talking results. In this paper we investigate the use of beam-forming in the context of speaker movement, together with commonly used adaptation techniques and compare against a naive multi-stream approach. We show that even such a simple approach can yield equivalent results to beam-forming, allowing for far more powerful integration of multiple microphone sources in ASR systems.

[1]  Richard M. Stern,et al.  Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Hynek Hermansky,et al.  Multi-band and adaptation approaches to robust speech recognition , 1997, EUROSPEECH.

[3]  Sridha Sridharan,et al.  Clustered Blind Beamforming From Ad-Hoc Microphone Arrays , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Ma Xiaochuan Optimum Array Processing Toolbox Based on MATLAB , 2008 .

[5]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[6]  David A. van Leeuwen,et al.  The 2007 AMI(DA) System for Meeting Transcription , 2007, CLEAR.

[7]  Jont B. Allen,et al.  Multimicrophone signal‐processing technique to remove room reverberation from speech signals , 1977 .

[8]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  W. Marsden I and J , 2012 .

[10]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[11]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[12]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[14]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..