An iterative model-based approach to cochannel speech separation

Cochannel speech separation aims to separate two speech signals from a single mixture. In a supervised scenario, the identities of two speakers are given, and current methods use pre-trained speaker models for separation. One issue in model-based methods is the mismatch between training and test signal levels. We propose an iterative algorithm to adapt speaker models to match the signal levels in testing. Our algorithm first obtains initial estimates of source signals using unadapted speaker models and then detects the input signal-to-noise ratio (SNR) of the mixture. The input SNR is then used to adapt the speaker models for more accurate estimation. The two steps iterate until convergence. Compared to search-based SNR detection methods, our method is not limited to given SNR levels. Evaluations demonstrate that the iterative procedure converges quickly in a considerable range of SNRs and improves separation results significantly. Comparisons show that the proposed system performs significantly better than related model-based systems.

[1]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[2]  Cheung-Chi Leung,et al.  Integrating multiple observations for model-based single-microphone speech separation with conditional random fields , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Richard M. Dansereau,et al.  Scaled factorial hidden Markov models: A new technique for compensating gain differences in model-based single channel speech separation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tomi Kinnunen,et al.  A Joint Approach for Single-Channel Speaker Identification and Speech Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Tomi Kinnunen,et al.  Signal-to-Signal Ratio Independent Speaker Identification for Co-channel Speech Signals , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[11]  Rémi Gribonval,et al.  Latent variable analysis and signal separation , 2012, Signal Process..

[12]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[13]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[14]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Richard M. Dansereau,et al.  Single-Channel Speech Separation Using Soft Mask Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Richard M. Dansereau,et al.  Long-Term Gain Estimation in Model-Based Single Channel Speech Separation , 2007 .

[17]  Daniel P. W. Ellis,et al.  Speech separation using speaker-adapted eigenvoice speech models , 2010, Comput. Speech Lang..

[18]  Søren Holdt Jensen,et al.  New Results on Single-Channel Speech Separation Using Sinusoidal Modeling , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Rainer Martin,et al.  On Phase Importance in Parameter Estimation for Single-Channel Source Separation , 2012, IWAENC.

[21]  Franz Pernkopf,et al.  Source–Filter-Based Single-Channel Speech Separation Using Pitch Information , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24]  Ning Ma,et al.  Speech fragment decoding techniques for simultaneous speaker identification and speech recognition , 2010, Comput. Speech Lang..

[25]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[26]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[27]  DeLiang Wang,et al.  A CASA-Based System for Long-Term SNR Estimation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[29]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[30]  DeLiang Wang,et al.  Sequential organization of speech in computational auditory scene analysis , 2009, Speech Commun..