Acoustic Features for Classification Based Speech Separation

Speech separation can be effectively formulated as a binary classification problem. A classification based system produces a binary mask using acoustic features in each time-frequency unit. So far, only pitch and amplitude modulation spectrogram have been used as unit level features. In this paper, we study other acoustic features and show that they can significantly improve both voiced and unvoiced speech separation performance. To further explore complementarity in terms of discriminative power, we propose a group Lasso approach for feature combination. The final combined feature set yields promising results in both matched and unmatched test conditions.

[1]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[2]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[3]  DeLiang Wang,et al.  HMM-Based Multipitch Tracking for Noisy and Reverberant Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[5]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[7]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[8]  Guoning Hu,et al.  Monaural speech organization and segregation , 2006 .

[9]  DeLiang Wang,et al.  A Supervised Learning Approach to Monaural Segregation of Reverberant Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[12]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[13]  Steve Renals,et al.  Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[15]  Yang Lu,et al.  An algorithm that improves speech intelligibility in noise for normal-hearing listeners. , 2009, The Journal of the Acoustical Society of America.

[16]  DeLiang Wang,et al.  An SVM based classification approach to speech separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).