Monaural speech segregation based on fusion of source-driven with model-driven techniques

In this paper by exploiting the prevalent methods in speech coding and synthesis, a new single channel speech segregation technique is presented. The technique integrates a model-driven method with a source-driven method to take advantage of both individual approaches and reduce their pitfalls significantly. We apply harmonic modelling in which the pitch and spectrum envelope are the main components for the analysis and synthesis stages. Pitch values of two speakers are obtained by using a source-driven method. The spectrum envelope, is obtained by using a new model-driven technique consisting of four components: a trained codebook of the vector quantized envelopes (VQ-based separation), a mixture-maximum approximation (MIXMAX), minimum mean square error estimator (MMSE), and a harmonic synthesizer. In contrast with previous model-driven techniques, this approach is speaker independent and can separate out the unvoiced regions as well as suppress the crosstalk effect which both are the drawbacks of source-driven or equivalently computational auditory scene analysis (CASA) models. We compare our fused model with both model- and source-driven techniques by conducting subjective and objective experiments. The results show that although for the speaker-dependent case, model-based separation delivers the best quality, for a speaker independent scenario the integrated model outperforms the individual approaches. This result supports the idea that the human auditory system takes on both grouping cues (e.g., pitch tracking) and a priori knowledge (e.g., trained quantized envelopes) to segregate speech signals.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  Wai C. Chu Vector Quantization of Harmonic Magnitudes in Speech Coding Applications—A Survey and New Technique , 2004, EURASIP J. Adv. Signal Process..

[3]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[4]  Joseph P. Campbell In Memory of Thomas E. Tremain 1934-1995 , 1996, IEEE Trans. Speech Audio Process..

[5]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[6]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[7]  A. Bregman Auditory Scene Analysis , 2008 .

[8]  John R. Hershey,et al.  Single microphone source separation using high resolution signal reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Gunnar Fant,et al.  Speech sounds and features , 1973 .

[10]  Biing-Hwang Juang,et al.  On the application of hidden Markov models for enhancing noisy speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Harald Pobloth,et al.  Squared error as a measure of perceived phase distortion. , 2003, The Journal of the Acoustical Society of America.

[12]  Steven M. Kay,et al.  Cochannel speaker separation by harmonic enhancement and suppression , 1997, IEEE Trans. Speech Audio Process..

[13]  S. Boll,et al.  Techniques for suppression of an interfering talker in co-channel speech , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  George R. Doddington,et al.  An integrated pitch tracking algorithm for speech systems , 1983, ICASSP.

[15]  Terrence J. Sejnowski,et al.  Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[16]  Chi-Ying Tsui,et al.  High-performance single clock cycle CMOS comparator , 2006 .

[17]  Anssi Klapuri,et al.  Separation of harmonic sound sources using sinusoidal modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Y.-H. Kwon,et al.  Simplified pitch detection algorithm of mixed speech signals , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[19]  Yannis Stylianou,et al.  Quantization of the spectral envelope for sinusoidal coders , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Hamid Sheikhzadeh,et al.  HMM-based strategies for enhancement of speech signals embedded in nonstationary noise , 1998, IEEE Trans. Speech Audio Process..

[21]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[22]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[23]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[24]  A. Banihashemi,et al.  Nonlinear minimum mean square error estimator for mixture-maximisation approximation , 2006 .

[25]  David Malah,et al.  Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Pierre Divenyi Speech Separation by Humans and Machines , 2004 .

[27]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[28]  D. Paul The spectral envelope estimation vocoder , 1981 .

[29]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[30]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[31]  Hirokazu Kameoka,et al.  Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear kalman filtering , 2004, INTERSPEECH.

[32]  Sharon Gannot,et al.  Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[33]  Simon J. Godsill,et al.  A Bayesian Approach for Blind Separation of Sparse Sources , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Bhiksha Raj,et al.  A minimum mean squared error estimator for single channel speaker separation , 2004, INTERSPEECH.

[35]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[36]  Brian A. Hanson,et al.  The harmonic magnitude suppression (EMS) technique for intelligibility enhancement in the presence of interfering speech , 1984, ICASSP.

[37]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[38]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[39]  Thomas F. Quatieri,et al.  An approach to co-channel talker interference suppression using a sinusoidal model for speech , 1990, IEEE Trans. Acoust. Speech Signal Process..

[40]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[41]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[42]  Ole Winther,et al.  Low complexity Bayesian single channel source separation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Philippe Martin Comparison of pitch detection by cepstrum and spectral comb analysis , 1982, ICASSP.

[45]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[46]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[47]  Daniel P. W. Ellis,et al.  Multiband audio modeling for single-channel acoustic source separation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[49]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[50]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..