Self-adaption in single-channel source separation

Single-channel source separation (SCSS) usually uses pre-trained source-specific models to separate the sources. These models capture the characteristics of each source and they perform well when matching the test conditions. In this paper, we extend the applicability of SCSS. We develop an EM-like iterative adaption algorithm which is capable to adapt the pre-trained models to the changed characteristics of the specific situation, such as a different acoustic channel introduced by variation in the room acoustics or changed speaker position. The adaption framework requires signal mixtures only, i.e. specific single source signals are not necessary. We consider speech/noise mixtures and we restrict the adaption to the speech model only. Model adaption is empirically evaluated using mixture utterances from the CHiME 2 challenge. We perform experiments using speaker dependent (SD) and speaker independent (SI) models trained on clean or reverberated single speaker utterances. We successfully adapt SI source models trained on clean utterances and achieve almost the same performance level as SD models trained on reverberated utterances.

[1]  A. Nadas,et al.  Speech recognition using noise-adaptive prototypes , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[4]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[5]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[6]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[7]  Franz Pernkopf,et al.  Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Franz Pernkopf,et al.  A Probabilistic Interaction Model for Multipitch Tracking With Factorial Hidden Markov Models , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Daniel P. W. Ellis,et al.  A variational EM algorithm for learning eigenvoice parameters in mixed signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[13]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[14]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[15]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[16]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[21]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..