Maximum a Posteriori Binary Mask Estimation for Underdetermined Source Separation Using Smoothed Posteriors

Sound source separation has become a topic of intensive research in the last years. The research effort has been specially relevant for the underdetermined case, where a considerable number of sparse methods working in the time-frequency (T-F) domain have appeared. In this context, although binary masking seems to be a preferred choice for source demixing, the estimated masks differ substantially from the ideal ones. This paper proposes a maximum a posteriori (MAP) framework for binary mask estimation. To this end, class-conditional source probabilities according to the observed mixing parameters are modeled via ratios of dependent Cauchy distributions while source priors are iteratively calculated from the observed histograms. Moreover, spatially smoothed posteriors in the T-F domain are proposed to avoid noisy estimates, showing that the estimated masks are closer to the ideal ones in terms of objective performance measures.

[1]  Guillermo Sapiro,et al.  Knowledge-based segmentation of SAR data with learned priors , 2000, IEEE Trans. Image Process..

[2]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[3]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[5]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  S. Gazor,et al.  Speech probability distribution , 2003, IEEE Signal Processing Letters.

[7]  D. Donoho,et al.  Maximal Sparsity Representation via l 1 Minimization , 2002 .

[8]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Guillermo Sapiro,et al.  Anisotropic smoothing of posterior probabilities , 1997, Proceedings of International Conference on Image Processing.

[10]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[11]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[12]  Roberto Togneri,et al.  Time-Frequency Masking: Linking Blind Source Separation and Robust Speech Recognition , 2008 .

[13]  Ahmet M. Kondoz,et al.  Acoustic Source Separation of Convolutive Mixtures Based on Intensity Vector Statistics , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Nikolaos Mitianoudis,et al.  Batch and Online Underdetermined Source Separation Using Laplacian Mixture Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Maximo Cobos,et al.  Two-microphone separation of speech mixtures based on interclass variance maximization. , 2010, The Journal of the Acoustical Society of America.

[17]  Emmanuel Vincent,et al.  The 2008 Signal Separation Evaluation Campaign: A Community-Based Approach to Large-Scale Evaluation , 2009, ICA.

[18]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[19]  Saeed Gazor,et al.  Local probability distribution of natural signals in sparse domains , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  T. Sikora,et al.  On the Use of Auditory Representations for Sparsity-Based Sound Source Separation , 2005, 2005 5th International Conference on Information Communications & Signal Processing.