A variational EM algorithm for learning eigenvoice parameters in mixed signals

We derive an efficient learning algorithm for model-based source separation for use on single channel speech mixtures where the precise source characteristics are not known a priori. The sources are modeled using factor-analyzed hidden Markov models (HMM) where source specific characteristics are captured by an “eigenvoice” speaker subspace model. The proposed algorithm is able to learn adaptation parameters for two speech sources when only a mixture of signals is observed. We evaluate the algorithm on the 2006 Speech Separation Challenge data set and show that it is significantly faster than our earlier system at a small cost in terms of performance.

[1]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  John R. Hershey,et al.  Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system , 2006, INTERSPEECH.

[3]  John R. Hershey,et al.  The Iroquois model: using temporal dynamics to separate speakers , 2006, SAPA@INTERSPEECH.

[4]  DeLiang Wang,et al.  A computational auditory scene analysis system for robust speech recognition , 2006, INTERSPEECH.

[5]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[6]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[7]  Jen-Tzung Chien,et al.  A new eigenvoice approach to speaker adaptation , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[8]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[9]  Daniel P. W. Ellis,et al.  Speech separation using speaker-adapted eigenvoice speech models , 2010, Comput. Speech Lang..

[10]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.