Monaural speech separation based on MAXVQ and CASA for robust speech recognition

Robustness is one of the most important topics for automatic speech recognition (ASR) in practical applications. Monaural speech separation based on computational auditory scene analysis (CASA) offers a solution to this problem. In this paper, a novel system is presented to separate the monaural speech of two talkers. Gaussian mixture models (GMMs) and vector quantizers (VQs) are used to learn the grouping cues on isolated clean data for each speaker. Given an utterance, speaker identification is firstly performed to identify the two speakers presented in the utterance, then the factorial-max vector quantization model (MAXVQ) is used to infer the mask signals and finally the utterance of the target speaker is resynthesized in the CASA framework. Recognition results on the 2006 speech separation challenge corpus prove that this proposed system can improve the robustness of ASR significantly.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[3]  Michael Picheny,et al.  Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jean-Luc Gauvain,et al.  Speaker verification over the telephone , 2000, Speech Commun..

[5]  Pierre Divenyi Speech Separation by Humans and Machines , 2004 .

[6]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[7]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[8]  John R. Hershey,et al.  Super-human multi-talker speech recognition: the IBM 2006 speech separation challenge system , 2006, INTERSPEECH.

[9]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[10]  DeLiang Wang,et al.  A computational auditory scene analysis system for robust speech recognition , 2006, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[12]  Mazin G. Rahim,et al.  Integrated bias removal techniques for robust speech recognition , 1999, Comput. Speech Lang..

[13]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[14]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[15]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition: Fundamentals and Applications , 1995 .

[16]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[17]  Jon Barker,et al.  Handling Missing and Unreliable Information in Speech Recognition , 2001, AISTATS.

[18]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[19]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[20]  James R. Glass,et al.  Combining missing-feature theory, speech enhancement, and speaker-dependent/-independent modeling for speech separation , 2006, Comput. Speech Lang..

[21]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[22]  Yunxin Zhao,et al.  Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises , 2000, IEEE Trans. Speech Audio Process..

[23]  Peng Li,et al.  Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[25]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  Bo Xu,et al.  Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech , 2006, IEEE Trans. Speech Audio Process..

[27]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[28]  T. Martin,et al.  On the effects of varying filter bank parameters on isolated word recognition , 1982 .

[29]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[30]  Guy J. Brown,et al.  A blackboard architecture for computational auditory scene analysis , 1999, Speech Commun..

[31]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[32]  E. de Boer,et al.  On cochlear encoding: Potentialities and limitations of the reverse‐correlation technique , 1978 .

[33]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[34]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[35]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[36]  John H. L. Hansen,et al.  Recent Advances in Robust Speech Recognition Technology , 2012 .

[37]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[38]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[39]  E. Owens,et al.  An Introduction to the Psychology of Hearing , 1997 .

[40]  Ivandro Sanches Noise-compensated hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[41]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Ning Ma,et al.  Recent advances in speech fragment decoding techniques , 2006, INTERSPEECH.

[43]  Daniel P. W. Ellis,et al.  The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[44]  Jon Barker,et al.  The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception. , 2008, The Journal of the Acoustical Society of America.

[45]  Jean-Claude Junqua,et al.  Robustness in Automatic Speech Recognition , 1996 .

[46]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[47]  Guy J. Brown,et al.  Separation of Speech by Computational Auditory Scene Analysis , 2005 .

[48]  R. W. Hukin,et al.  Effectiveness of spatial cues, prosody, and talker characteristics in selective attention. , 2000, The Journal of the Acoustical Society of America.

[49]  DeLiang Wang,et al.  Separation of fricatives and affricates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[50]  E. de Boer,et al.  On cochlear encoding: potentialities and limitations of the reverse-correlation technique. , 1978, The Journal of the Acoustical Society of America.

[51]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[52]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.