Variational Probabilistic Speech Separation Using Microphone Arrays

Separating multiple speech sources using a limited number of noisy sensor measurements presents a difficult problem, but one that is of great practical interest. Although previously introduced source separation methods [such as independent component analysis (ICA)] can be made to work in many situations, most of these methods fail when the sensors are very noisy or when the number of sources exceeds the number of sensors. Our approach to this problem is to combine the multiple sensor likelihoods [obtained using time-delay-of-arrival (TDOA) information] with a generative probability model of the sources. This model accounts for the power spectrum of each source using a mixture model, and accounts for the phase of each source using one discretized hidden phase variable for each frequency. Source separation is achieved by identifying the source vector configuration of maximum a posteriori probability, given all available information. An exhaustive search for the MAP configuration is computationally intractable, but we present an efficient variational technique that performs approximate probabilistic inference. For the problem of separating delayed additive noise corrupted speech mixtures, the algorithm is able to improve upon the signal-to-noise ratio (SNR) gain performance of existing state-of-the-art probabilistic and TDOA-based speech separation algorithms by over 10 dB. This significant performance improvement is obtained by combining the information utilized by these approaches intelligently under a representative probabilistic description of the speech production and mixing process. The method is capable of recovering high fidelity estimates of the underlying speech sources even when there are more sources than microphone observations

[1]  Parham Aarabi,et al.  Robust digit recognition using phase-dependent time-frequency masking , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[2]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[3]  A. J. Bell,et al.  A Unifying Information-Theoretic Framework for Independent Component Analysis , 2000 .

[4]  T. Ens,et al.  Blind signal separation : statistical principles , 1998 .

[5]  Ziyou Xiong,et al.  NONLINEAR INDEPENDENT COMPONENT ANALYSIS(ICA) USING POWER SERIES AND APPLICATION TO BLIND SOURCE SEPARATION , 2001 .

[6]  Frank Ehlers,et al.  Blind separation of convolutive mixtures and an application in automatic speech recognition in a noisy environment , 1997, IEEE Trans. Signal Process..

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[9]  I. Miller Probability, Random Variables, and Stochastic Processes , 1966 .

[10]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[11]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[12]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[13]  Terrence J. Sejnowski,et al.  Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[14]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[15]  Kari Torkkola,et al.  Blind Separation For Audio Signals - Are We There Yet? , 1999 .

[16]  Kiyohiro Shikano,et al.  Blind source separation for speech based on fast-convergence algorithm with ICA and beamforming , 2001, INTERSPEECH.

[17]  John G. Proakis,et al.  Probability, random variables and stochastic processes , 1985, IEEE Trans. Acoust. Speech Signal Process..

[18]  David J. C. Mackay,et al.  Introduction to Monte Carlo Methods , 1998, Learning in Graphical Models.

[19]  Parham Aarabi,et al.  EURASIP Journal on Applied Signal Processing 2003:4, 338–347 c ○ 2003 Hindawi Publishing Corporation The Fusion of Distributed Microphone Arrays for Sound Localization , 2002 .

[20]  Alex Acero,et al.  Speech/noise separation using two microphones and a VQ model of speech signals , 2000, INTERSPEECH.

[21]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[22]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[23]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[24]  Amir Dembo,et al.  A minimum discrimination information approach for hidden Markov modeling , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Hagai Attias,et al.  New EM algorithms for source separation and deconvolution with a microphone array , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[26]  Li Deng,et al.  Speech Denoising and Dereverberation Using Probabilistic Models , 2000, NIPS.

[27]  Hagai Attias,et al.  Blind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm , 1998, Neural Computation.

[28]  Michael S. Brandstein On the use of explicit speech modeling in microphone array applications , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[30]  Brendan J. Frey,et al.  Robust variational speech separation using fewer microphones than speakers , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[31]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[32]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[33]  Lawrence R. Rabiner,et al.  A minimum discrimination information approach for hidden Markov modeling , 1989, IEEE Trans. Inf. Theory.

[34]  Brendan J. Frey,et al.  Learning Dynamic Noise Models from Noisy Speech for Robust Speech Recognition , 2001 .

[35]  Pedro Larrañaga,et al.  An Introduction to Probabilistic Graphical Models , 2002, Estimation of Distribution Algorithms.

[36]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[37]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[38]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[39]  Nevena Lazic,et al.  Adaptive time-frequency data fusion for speech enhancement , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[40]  Brendan J. Frey,et al.  ALGONQUIN: iterating laplace's method to remove multiple types of acoustic distortion for robust speech recognition , 2001, INTERSPEECH.

[41]  Fabian J. Theis,et al.  Geometric overcomplete ICA , 2002, ESANN.

[42]  Fabian J. Theis,et al.  Overcomplete ICA with a Geometric Algorithm , 2002, ICANN.

[43]  Terrence J. Sejnowski,et al.  A Non-linear Information Maximisation Algorithm that Performs Blind Separation , 1994, NIPS.

[44]  L. Vielva,et al.  UNDERDETERMINED BLIND SOURCE SEPARATION USING A PROBABILISTIC SOURCE SPARSITY MODEL , 2001 .

[45]  Parham Aarabi,et al.  ITERATIVE SPATIAL PROBABILITY BASED SOUND LOCALIZATION , 2000 .

[46]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[47]  Hagai Attias,et al.  Source Separation with a Sensor Array Using Graphical Models and Subband Filtering , 2002, NIPS.

[48]  Kari Torkkola,et al.  Blind separation of convolved sources based on information maximization , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[49]  Sejnowski,et al.  ICA MIXTURE MODELS FOR UNSUPERVISED CLASSIFICATION AND AUTOMATIC CONTEXT SWITCHING , 2000 .

[50]  Parham Aarabi Application of spatial likelihood functions to multicamera object localization , 2001, SPIE Defense + Commercial Sensing.

[51]  Phillip L. De Leon,et al.  Blind source separation of mixtures of speech signals with unknown propagation delays , 2000 .

[52]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Kari Torkkola,et al.  Blind separation of delayed sources based on information maximization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[54]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[55]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .