Unsupervised single-channel speech separation via deep neural network for different gender mixtures

In this study, we propose a regression approach via deep neural network (DNN) for unsupervised speech separation in a single-channel setting. We rely on a key assumption that two speakers could be well segregated if they are not too similar to each other. A dissimilarity measure between two speakers is then proposed to characterize the separation ability between competing speakers. We demonstrate that the distance between speakers of different genders is large enough to warrant a possible separation. We finally propose a DNN architecture with dual outputs, one representing the female speaker group and the other characterizing the male speaker group. Trained and tested on the Speech Separation Challenge corpus our experimental results show that the proposed DNN approach achieves large performance gains over the state-of-the-art unsupervised techniques without using specific knowledge about the mixed target and interfering speakers and even outperforms the supervised GMM-based method.

[1]  Yunxin Zhao,et al.  Co-channel speech separation for robust automatic speech recognition: stability and efficiency , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Richard M. Dansereau,et al.  Single-Channel Speech Separation Using Soft Mask Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Jonathan Le Roux,et al.  Deep NMF for speech separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[5]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[6]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  DeLiang Wang,et al.  An iterative model-based approach to cochannel speech separation , 2013, EURASIP J. Audio Speech Music. Process..

[8]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[9]  Jun Du,et al.  Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[10]  Richard M. Dansereau,et al.  Speaker-independent model-based single channel speech separation , 2008, Neurocomputing.

[11]  Jun Du,et al.  Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[12]  Bhiksha Raj,et al.  Soft Mask Methods for Single-Channel Speaker Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[16]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[17]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[20]  Franz Pernkopf,et al.  Representation Learning for Single-Channel Source Separation and Bandwidth Extension , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.