Monaural segregation of voiced speech using discriminative random fields

Techniques for separating speech from background noise and other sources of interference have important applications for robust speech recognition and speech enhancement. Many traditional computational auditory scene analysis (CASA) based approaches decompose the input mixture into a time-frequency (T-F) representation, and attempt to identify the T-F units where the target energy dominates that of the interference. This is accomplished using a two stage process of segmentation and grouping. In this pilot study, we explore the use of Discriminative Random Fields (DRFs) for the task of monaural speech segregation. We find that the use of DRFs allows us to effectively combine multiple auditory features into the system, while simultaneously integrating the the two CASA stages into one. Our preliminary results suggest that CASA based approaches may benefit from the DRF framework.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[3]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[4]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[5]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[6]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Martial Hebert,et al.  Discriminative Random Fields , 2006, International Journal of Computer Vision.

[8]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[10]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[11]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[13]  R. Patterson,et al.  B OF THE SVOS FINAL REPORT ( Part A : The Auditory Filterbank ) AN EFFICIENT AUDITORY FIL TERBANK BASED ON THE GAMMATONE FUNCTION , 2010 .

[14]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.