Target speech extractionwith learned spectral bases

In this paper we present a method for extracting a speech signal of target speaker from noisy convolutive mixtures of target speech and an interference source, when training utterances of the target speaker are available. We incorporate a statistical latent variable model into blind source separation (BSS), where we make use of spectral bases learned from the training utterances of the target speaker to identify which source corresponds to the target speaker. Combined with any existing BSS methods, our post-processing (which is the main contribution) consists of two steps: (1) channel selection where we identify the source corresponding to the target speaker; (2) enhancement where we further suppress the remaining interference. Numerical experiments confirm that our method substantially improves the separation quality of existing BSS methods and successfully restores the target speaker's speech.

[1]  Seungjin Choi,et al.  Probabilistic matrix tri-factorization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Lucas C. Parra,et al.  Convolutive blind separation of non-stationary sources , 2000, IEEE Trans. Speech Audio Process..

[3]  Éric Gaussier,et al.  Relation between PLSA and NMF and implications , 2005, SIGIR '05.

[4]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[5]  Kari Torkkola,et al.  Blind separation of convolved sources based on information maximization , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[6]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Andrzej Cichocki,et al.  Nonnegative Matrix Factorization for Motor Imagery EEG Classification , 2006, ICANN.

[8]  Guo Wei,et al.  Convolutive Blind Source Separation of Non-stationary Source , 2011 .

[9]  Christine Serviere,et al.  BLIND SEPARATION OF CONVOLUTIVE AUDIO MIXTURES USING NONSTATIONARITY , 2003 .

[10]  Hiroshi Sawada,et al.  Blind extraction of a dominant source from mixtures of many sources using ICA and time-frequency masking , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[11]  Bhiksha Raj,et al.  Sparse Overcomplete Decomposition for Single Channel Speaker Separation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Shun-ichi AMARIyy,et al.  NATURAL GRADIENT LEARNING WITH A NONHOLONOMIC CONSTRAINT FOR BLIND DECONVOLUTION OF MULTIPLE CHANNELS , 1999 .

[13]  Te-Won Lee,et al.  A Spatio-Temporal Speech Enhance Speech Recogn , 2002 .

[14]  Jedrzej Kocinski,et al.  Speech intelligibility improvement using convolutive blind source separation assisted by denoising algorithms , 2008, Speech Commun..

[15]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.

[16]  Seungjin Choi,et al.  Nonnegative features of spectro-temporal sounds for classification , 2005, Pattern Recognit. Lett..

[17]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[18]  Sven Nordholm,et al.  Spatio-temporal processing for distant speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  B. Raj,et al.  Latent variable decomposition of spectrograms for single channel speaker separation , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[20]  Mohamed Sahmoudi,et al.  Blind Separation of Convolutive Mixtures using Nonstationarity and Fractional Lower Order Statistics (FLOS): Application to Audio Signals , 2006, Fourth IEEE Workshop on Sensor Array and Multichannel Processing, 2006..

[21]  Andrzej Cichocki,et al.  Second Order Nonstationary Source Separation , 2002, J. VLSI Signal Process..

[22]  S.C. Douglas,et al.  Multichannel blind deconvolution and equalization using the natural gradient , 1997, First IEEE Signal Processing Workshop on Signal Processing Advances in Wireless Communications.