AT wo-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments

An acoustic front-end for robust automatic speech recognition in noisy and reverberantenvironmentsis proposed in this contribution. It comprises a blind source separation-based signal extraction scheme and only requires two microphone signals. The proposed front-end and its integrationinto the recognitionsystem is analyzed and evaluated in noisy living room-like environments according to the PASCAL CHiME challenge. The results show that the introduced system significantly improves the recognition performance compared to the challenge baseline.

[1]  Walter Kellermann,et al.  Multidimensional localization of multiple sound sources using averaged directivity patterns of Blind Source Separation systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[3]  Steve Young,et al.  The HTK book , 1995 .

[4]  Kiyohiro Shikano,et al.  Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Walter Kellermann,et al.  A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[7]  Mark J. F. Gales Adaptive training for robust ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[8]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[9]  Walter Kellermann,et al.  BSS for improved interference estimation for Blind speech signal Extraction with two microphones , 2009, 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[10]  Walter Kellermann,et al.  An acoustic front-end for interactive TV incorporating multichannel acoustic echo cancellation and blind signal extraction , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[11]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[12]  Walter Kellermann,et al.  A GENERALIZATION OF A CLASS OF BLIND SOURCE SEPARATION ALGORITHMS FOR CONVOLUTIVE MIXTURES , 2003 .

[13]  Wolfgang Herbordt,et al.  Application of a double-talk resilient DFT domain adaptive filter for bin-wise stepsize controls to adaptive beamforming , 2005 .

[14]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[15]  W. Marsden I and J , 2012 .

[16]  Jacob Benesty,et al.  Audio Signal Processing for Next-Generation Multimedia Communication Systems , 2004 .

[17]  Akihiko Sugiyama,et al.  A real time robust adaptive microphone array controlled by an SNR estimate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[19]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[20]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[21]  Walter Kellermann,et al.  Blind Source Separation for Convolutive Mixtures: A Unified Treatment , 2004 .