Binary mask estimation for voiced speech segregation using Bayesian method

The ideal binary mask (IBM) estimation has been set as the computational goal of Computational auditory scene analysis (CASA). A lot of effort has been made in the IBM estimation via statistical learning method. The current Bayesian methods usually estimate the mask value of each time-frequency (T-F) unit independently with only local auditory features. In this paper, we propose a new Bayesian approach. First, a set of pitch-based auditory features are summarized to exploit the inherent characteristics of the reliable and unreliable time-frequency (T-F) units. A rough estimation is obtained according to Maximum Likelihood (ML) rule. Then, we propose a prior model which is derived from onset/offset segmentation to improve the estimation. Finally, an efficient Markov Chain Monte Carlo (MCMC) procedure is applied to approach the maximum a posterior (MAP) estimation. Proposed method is evaluated on Cooke's 100 mixtures and compared with previous model. Experiments show that our method performs better.

[1]  J. Inoue,et al.  Image restoration using the Q-Ising spin glass. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[4]  DeLiang Wang,et al.  An SVM based classification approach to speech separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Peng Li,et al.  Multipitch Detection Based on Weighted Summary Correlogram , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[6]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[7]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[8]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[9]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[10]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[11]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[12]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[13]  Monte Carlo Integration Markov Chain Monte Carlo and Gibbs Sampling , 2002 .

[14]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[15]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  W. Gilks Markov Chain Monte Carlo , 2005 .