Single Channel Speech Source Separation Using Hierarchical Deep Neural Networks

Single-channel speech source separation is a well known task for preparing speech signals for some applications like speech recognition and enhancement. In this paper, we introduce a novel design for separating sources with the help of hierarchical deep neural networks and time-frequency masks. The proposed method classifies the mixture signals in three categories based on the mixed genders in the first hierarchy. Thus, three other networks, each for a specific mixture type, use these categorized data for speech separation. Then, an enhancement stage improves the quality of voices considering an improved cost function that reduces the interference of the estimated sources of the previous stage. The demanded data is gathered from TSP corpus and the output of the systems have been evaluated with different metrics such as signal to distortion ratio (SDR), signal to interference ratio (SIR) and Perceptual evaluation of speech quality (PESQ). Comparing with other methods, the proposed architecture works considerably better and the results are outstanding.

[1]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[2]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[3]  Joseph Tabrikian,et al.  Blind Separation of Independent Sources Using Gaussian Mixture Model , 2007, IEEE Transactions on Signal Processing.

[4]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[5]  Saeid Safavi,et al.  A Performance Evaluation of Several Deep Neural Networks for Reverberant Speech Separation , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[6]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Ronald W. Schafer,et al.  Theory and Applications of Digital Speech Processing , 2010 .

[8]  Saeed Gazor,et al.  On the distribution of Mel-filtered log-spectrum of speech in additive noise , 2015, Speech Commun..

[9]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[10]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Wenwu Wang,et al.  Monaural Source Separation in Complex Domain With Long Short-Term Memory Neural Network , 2019, IEEE Journal of Selected Topics in Signal Processing.

[12]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[13]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Frank Ehlers,et al.  Blind separation of convolutive mixtures and an application in automatic speech recognition in a noisy environment , 1997, IEEE Trans. Signal Process..

[15]  DeLiang Wang,et al.  Recurrent Neural Networks for Cochannel Speech Separation in Reverberant Environments , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Sanaz Seyedin,et al.  Modular dynamic deep denoising autoencoder for speech enhancement , 2017, 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE).