A Supervised Learning Approach to Monaural Segregation of Reverberant Speech

A major source of signal degradation in real environments is room reverberation. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. This paper proposes a supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The models trained using this objective function yield significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers.

[1]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[2]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[3]  A. Gualtierotti H. L. Van Trees, Detection, Estimation, and Modulation Theory, , 1976 .

[4]  J. J. Jetzt Critical distance measurement of rooms from the sound energy spectral response , 1977 .

[5]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[6]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[7]  H. Gaskell The precedence effect , 1983, Hearing Research.

[8]  Mitchel Weintraub,et al.  A theory and computational model of auditory monaural sound separation , 1985 .

[9]  John Mourjopoulos On the variation and invertibility of room impulse response functions , 1985 .

[10]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[12]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[13]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[16]  Hermann Ney,et al.  On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Martin T. Hagan,et al.  Neural network design , 1995 .

[18]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[19]  Michael S. Brandstein On the use of explicit speech modeling in microphone array applications , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Igor Kononenko,et al.  Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[21]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[22]  Petr Sojka Text, Speech and Dialogue: Second International Workshop, TSD'99 Plzen, Czech Republic, September 13-17, 1999, Proceedings , 1999 .

[23]  Rodney A. Kennedy,et al.  Equalization in an acoustic reverberant environment: robustness results , 2000, IEEE Trans. Speech Audio Process..

[24]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[25]  Henrique S. Malvar,et al.  Speech dereverberation via maximum-kurtosis subband adaptive filtering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[26]  A. Bregman Auditory Scene Analysis , 2001 .

[27]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[28]  Marco Saerens,et al.  Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure , 2002, Neural Computation.

[29]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[30]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[31]  Sameer Singh,et al.  Multiresolution Estimates of Classification Complexity , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Chaz Yee Toh,et al.  Effects of reverberation on perceptual segregation of competing voices. , 2003, The Journal of the Acoustical Society of America.

[33]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[34]  Michael I. Jordan,et al.  Blind One-microphone Speech Separation: A Spectral Learning Approach , 2004, NIPS.

[35]  Guy J. Brown,et al.  A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation , 2004, Speech Commun..

[36]  Jesús Cid-Sueiro,et al.  Minimax classifiers based on neural networks , 2005, Pattern Recognit..

[37]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[38]  Mingyang Wu,et al.  A pitch-based method for the estimation of short reverberation time , 2006 .

[39]  Lauren Calandruccio,et al.  Determination of the Potential Benefit of Time-Frequency Gain Manipulation , 2006, Ear and hearing.

[40]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[41]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[43]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[44]  DeLiang Wang,et al.  Binaural segregation in multisource reverberant environments. , 2006, The Journal of the Acoustical Society of America.

[45]  DeLiang Wang,et al.  Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. , 2006, The Journal of the Acoustical Society of America.

[46]  Guoning Hu,et al.  Monaural speech organization and segregation , 2006 .

[47]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  DeLiang Wang,et al.  Pitch-based monaural segregation of reverberant speech. , 2006, The Journal of the Acoustical Society of America.

[49]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Richard M. Dansereau,et al.  Single-Channel Speech Separation Using Soft Mask Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Daniel P. W. Ellis,et al.  Monaural Speech Separation using Source-Adapted Models , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[52]  I. Winter,et al.  The Effect of Reverberation on the Temporal Representation of the F0 of Frequency Swept Harmonic Complexes in the Ventral Cochlear Nucleus , 2007 .

[53]  Nikolaj Tatti,et al.  Distances between Data Sets Based on Summary Statistics , 2007, J. Mach. Learn. Res..

[54]  DeLiang Wang,et al.  Segregation of unvoiced speech from nonspeech interference. , 2008, The Journal of the Acoustical Society of America.

[55]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006, IEEE Trans. Neural Networks.

[56]  Daniel P. W. Ellis,et al.  Preliminary intelligibility tests of a monaural speech segregation system , 2008, SAPA@INTERSPEECH.

[57]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[58]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2009, IEEE Trans. Speech Audio Process..

[59]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .