Microphone Array Processing Strategies for Distant-Based Automatic Speech Recognition

Robust distant speech recognition (DSR) is necessary in many speech technology applications using multiple microphones but has received only limited treatment in the literature. In this paper, we work on communicating with vehicle voice-controlled system which is one of the applications of DSR. Two approaches for DSR are i) signal-level combination using beamforming followed by automatic speech recognition (ACR), and ii) word hypothesis-level combination using several speech recognition engines followed by confusion network combination or followed by recognizer output voting error reduction (ROVER). In addition to these approaches, it is possible to examine training-level combination by training the recognizer on audio signals from multiple channels (microphones). In this paper, the authors investigate how these methods can be leveraged for in-vehicle ACR using the CU-Move corpus. The authors propose various combinations of these three methods to find an optimum structure for in-vehicle ACR. The authors also investigate the effect of speaker adaptation (SA). The author's experience shows that applying SA on individual channels and merging the results with ROVER reduces the negative effects of SA reported by others in the field, and illustrates the overall improvement obtained with front-end enhancement techniques in DSR.

[1]  Andreas Stolcke,et al.  Making themost from multiple microphones in meeting recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[4]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[5]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[6]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  David Gesbert,et al.  Enhanced multiuser random beamforming: dealing with the not so large number of users case , 2008, IEEE Journal on Selected Areas in Communications.

[8]  John H. L. Hansen,et al.  Multichannel feature enhancement in distributed microphone arrays for robust distant speech recognition in smart rooms , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Søren Holdt Jensen,et al.  Improving Robustness Against Environmental Sounds for Directing Attention of Social Robots , 2014, MA3HMI@INTERSPEECH.

[10]  Chi Zhang,et al.  Microphone array processing for distance speech capture: A probe study on whisper speech detection , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[11]  John W. McDonough,et al.  Multi-source far-distance microphone selection and combination for automatic transcription of lectures , 2006, INTERSPEECH.

[12]  John H. L. Hansen,et al.  CSA-BF: a constrained switched adaptive beamformer for speech enhancement and recognition in real car environments , 2003, IEEE Trans. Speech Audio Process..

[13]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  John H. L. Hansen,et al.  CU-Move: Advanced In-Vehicle Speech Systems for Route Navigation , 2005 .

[15]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[16]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[17]  John H. L. Hansen,et al.  An efficient microphone array based voice activity detector for driver's speech in noise and music rich in-vehicle environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.