A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction

We propose a new framework for joint multichannel speech source separation and acoustic noise reduction. In this framework, we start by formulating the minimum-mean-square error (MMSE)-based solution in the context of multiple simultaneous speakers and background noise, and outline the importance of the estimation of the activities of the speakers. The latter is accurately achieved by introducing a latent variable that takes N+1 possible discrete states for a mixture of N speech signals plus additive noise. Each state characterizes the dominance of one of the N+1 signals. We determine the posterior probability of this latent variable, and show how it plays a twofold role in the MMSE-based speech enhancement. First, it allows the extraction of the second order statistics of the noise and each of the speech signals from the noisy data. These statistics are needed to formulate the multichannel Wiener-based filters (including the minimum variance distortionless response). Second, it weighs the outputs of these linear filters to shape the spectral contents of the signals' estimates following the associated target speakers' activities. We use the spatial and spectral cues contained in the multichannel recordings of the sound mixtures to compute the posterior probability of this latent variable. The spatial cue is acquired by using the normalized observation vector whose distribution is well approximated by a Gaussian-mixture-like model, while the spectral cue can be captured by using a pre-trained Gaussian mixture model for the log-spectra of speech. The parameters of the investigated models and the speakers' activities (posterior probabilities of the different states of the latent variable) are estimated via expectation maximization. Experimental results including comparisons with the well-known independent component analysis and masking are provided to demonstrate the efficiency of the proposed framework.

[1]  J. Kent The Complex Bingham Distribution and Shape Analysis , 1994 .

[2]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Reinhold Häb-Umbach,et al.  Blind speech separation employing directional statistics in an Expectation Maximization framework , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[5]  J. Flanagan,et al.  Computer‐steered microphone arrays for sound transduction in large rooms , 1985 .

[6]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[7]  Marc Moonen,et al.  Variable Speech Distortion Weighted Multichannel Wiener Filter based on Soft Output Voice Activity Detection for Noise Reduction in Hearing Aids , 2008 .

[8]  Israel Cohen,et al.  Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[10]  I. Cohen,et al.  Noise estimation by minima controlled recursive averaging for robust speech enhancement , 2002, IEEE Signal Processing Letters.

[11]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[12]  Hiroshi Sawada,et al.  A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[13]  Masakiyo Fujimoto,et al.  Reduction of Highly Nonstationary Ambient Noise by Integrating Spectral and Locational Characteristics of Speech and Noise for Robust ASR , 2011, INTERSPEECH.

[14]  Chong Kwan Un,et al.  Speech recognition in noisy environments using first-order vector Taylor series , 1998, Speech Commun..

[15]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Sofiène Affes,et al.  A signal subspace tracking algorithm for microphone array processing of speech , 1997, IEEE Trans. Speech Audio Process..

[17]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  S. Furui,et al.  A JAPANESE NATIONAL PROJECT ON SPONTANEOUS SPEECH CORPUS AND PROCESSING TECHNOLOGY , 2003 .

[19]  Marc Moonen,et al.  Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction , 2007, Speech Commun..

[20]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[21]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[22]  Hiroshi Sawada,et al.  Stereo Source Separation and Source Counting with MAP Estimation with Dirichlet Prior Considering Spatial Aliasing Problem , 2009, ICA.

[23]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[24]  Takaaki Hori NTT Speech recognizer with OutLook On the Next generation : SOLON , 2004 .

[25]  Hiroshi Sawada,et al.  Frequency-Domain Blind Source Separation , 2007, Blind Speech Separation.

[26]  Shinji Watanabe,et al.  Discriminative training based on an integrated view of MPE and MMI in margin and error space , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yutaka Kaneda,et al.  Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones , 2001 .

[28]  K. Mardia,et al.  The complex Watson distribution and shape analysis , 1999 .

[29]  Masakiyo Fujimoto,et al.  Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation , 2011 .

[30]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[32]  Barak A. Pearlmutter,et al.  The LOST Algorithm: Finding Lines and Separating Speech Mixtures , 2008, EURASIP J. Adv. Signal Process..

[33]  Hiroshi Sawada,et al.  A multichannel MMSE-based framework for joint blind source separation and noise reduction , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[35]  Masakiyo Fujimoto,et al.  Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[37]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[38]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Marc Moonen,et al.  Performance Analysis of Multichannel Wiener Filter-Based Noise Reduction in Hearing Aids Under Second Order Statistics Estimation Errors , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[41]  Israel Cohen,et al.  Convolutive Transfer Function Generalized Sidelobe Canceler , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Barak A. Pearlmutter,et al.  Soft-LOST: EM on a Mixture of Oriented Lines , 2004, ICA.

[43]  Sharon Gannot,et al.  Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[44]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[45]  Rainer Martin,et al.  A Versatile Framework for Speaker Separation Using a Model-Based Speaker Localization Approach , 2011, IEEE Transactions on Audio, Speech, and Language Processing.