Blind source extraction for robust speech recognition in multisource noisy environments

This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion. While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate.

[1]  Martin Bouchard,et al.  Real-world particle filtering-based speech enhancement , 2010, 2010 2nd International Workshop on Cognitive Information Processing.

[2]  E. Hänsler,et al.  Acoustic Echo and Noise Control: A Practical Approach , 2004 .

[3]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[5]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[6]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[7]  R.M. Stern,et al.  Missing-feature approaches in speech recognition , 2005, IEEE Signal Processing Magazine.

[8]  Hiroshi Sawada,et al.  REAL-TIME BLIND SOURCE SEPARATION FOR MOVING SPEAKERS USING BLOCKWISE ICA AND RESIDUAL CROSSTALK SUBTRACTION , 2003 .

[9]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[10]  Roger L. Freeman Wiley Series in Telecommunications and Signal Processing , 2005 .

[11]  Francesco Nesta,et al.  Convolutive Underdetermined Source Separation through Weighted Interleaved ICA and Spatio-temporal Source Correlation , 2012, LVA/ICA.

[12]  Francesco Nesta,et al.  Multiple source tracking by sequential posterior kernel density estimation through GSCT , 2011, 2011 19th European Signal Processing Conference.

[13]  Andreas Ziehe,et al.  The 2011 Signal Separation Evaluation Campaign (SiSEC2011): - Audio Source Separation - , 2012, LVA/ICA.

[14]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[16]  Walter Kellermann Some current challenges in multichannel acoustic signal processing , 2006 .

[17]  Scott C. Douglas,et al.  Scaled Natural Gradient Algorithms for Instantaneous and Convolutive Blind Source Separation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  K. Matsuoka,et al.  Minimal distortion principle for blind source separation , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[19]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[20]  Dietrich Klakow,et al.  Beamforming With a Maximum Negentropy Criterion , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Francesco Nesta,et al.  Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction , 2011 .

[22]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[23]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[24]  Roland Maas,et al.  AT wo-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments , 2011 .

[25]  Maurizio Omologo,et al.  Use of a CSP-based voice activity detector for distant-talking ASR , 2003, INTERSPEECH.

[26]  A. Hirano,et al.  A noise-robust stochastic gradient algorithm with an adaptive step-size suitable for mobile hands-free telephones , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[27]  Fabian J. Theis,et al.  The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges , 2012, Signal Process..

[28]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[29]  Francesco Nesta,et al.  Convolutive BSS of Short Mixtures by ICA Recursively Regularized Across Frequencies , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  DeLiang Wang,et al.  Two-Microphone Separation of Speech Mixtures , 2008, IEEE Transactions on Neural Networks.

[31]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Francesco Nesta,et al.  Enhanced multidimensional spatial functions for unambiguous localization of multiple sparse acoustic sources , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[34]  Francesco Nesta,et al.  Generalized State Coherence Transform for Multidimensional TDOA Estimation of Multiple Sources , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Kiyohiro Shikano,et al.  Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Jun Du,et al.  A Feature Compensation Approach Using High-Order Vector Taylor Series Approximation of an Explicit Distortion Model for Noisy Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Francesco Nesta,et al.  On the robustness of the multidimensional state coherence transform for solving the permutation problem of frequency-domain ICA , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[40]  Ted S. Wada,et al.  Batch-Online Semi-Blind Source Separation Applied to Multi-Channel Acoustic Echo Cancellation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Francesco Nesta,et al.  Real-Time Prototype for Integration of Blind Source Extraction and Robust Automatic Speech Recognition , 2011, INTERSPEECH.

[42]  Benedikt Loesch,et al.  Cramér-Rao Bound for Circular Complex Independent Component Analysis , 2012, LVA/ICA.

[43]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[44]  Hugo Van hamme,et al.  Automatic Speech Recognition Using Missing Data Techniques: Handling of Real-World Data , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[45]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.