Speaker Extraction Using Stacked BLSTM Optimized with Frequency-domain Differentiated Spectrum Loss

We propose a novel speaker extraction method that extracts the target speaker from the mixture of the target speaker and an unknown interferer in a semi-supervised scheme. The dynamic nature of the speech signal urges the use of Recurrent Neural Networks (RNN). For this work, we use two stacked layers of Bidirectional Long Short-Term Memory (BLSTM) to account for the continuity between the consecutive speech frames. Moreover, to preserve the spectrum changes of the target speaker in the frequency domain, we propose a loss function that differentiates between the neighboring frequency bins in a frame. This loss function is a weighted sum of the frequency-dependent loss function and Mean Squared Error (MSE). The evaluation results on the Speech Separation Challenge (SSC) dataset show that the proposed approach outperforms the baseline method, which uses a Deep Neural Network (DNN), in terms of three measurement scores such as Speech to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI). Also, By comparing the aforementioned evaluation metrics from two similar networks, where one is optimized by the proposed loss function and the other optimized by MSE, we show that the new loss function provides better results for speaker extraction applications.

[1]  Tuomas Virtanen,et al.  Deep Neural Network Based Speech Separation Optimizing an Objective Estimator of Intelligibility for Low Latency Applications , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[2]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[3]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[4]  Saeed Gazor,et al.  On the distribution of Mel-filtered log-spectrum of speech in additive noise , 2015, Speech Commun..

[5]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Jesper Jensen,et al.  Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  DeLiang Wang,et al.  An iterative model-based approach to cochannel speech separation , 2013, EURASIP J. Audio Speech Music. Process..

[10]  Yi Hu,et al.  Subjective Comparison of Speech Enhancement Algorithms , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  Dirk Van Compernolle,et al.  A family of MLP based nonlinear spectral estimators for noise reduction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[17]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[18]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[22]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[23]  Haizhou Li,et al.  Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Sanaz Seyedin,et al.  Spectro-temporal Power Spectrum Features for Noise Robust ASR , 2017, Circuits Syst. Signal Process..