Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network

The recently-proposed deep clustering-based algorithms represent a fundamental advance towards the single-channel multi-speaker speech sep- aration problem. These methods use an ideal binary mask to construct the objective function and K-means clustering method to estimate the ideal bina- ry mask. However, when sources belong to the same class or the number of sources is large, the assumption that one time-frequency unit of the mixture is dominated by only one source becomes weak, and the IBM-based separation causes spectral holes or aliasing. Instead, in our work, the quantized ideal ratio mask was proposed, the ideal ratio mask is quantized to have the output of the neural network with a limited number of possible values. Then the quan- tized ideal ratio mask is used to construct the objective function for the case of multi-source domination, to improve network performance. Furthermore, a network framework that combines a residual network, a recurring network, and a fully connected network was used for exploiting correlation information of frequency in our work. We evaluated our system on TIMIT dataset and show 1.6 dB SDR improvement over the previous state-of-the-art methods.

[1]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[3]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[5]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Jonathan Le Roux,et al.  Teacher-student Deep Clustering for Low-delay Single Channel Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Ganesh Ananthanarayanan,et al.  Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest , 2017, Multimedia Tools and Applications.

[12]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[13]  Changshui Zhang,et al.  Listen and Look: Audio–Visual Matching Assisted Speech Source Separation , 2018, IEEE Signal Processing Letters.

[14]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[15]  Bhiksha Raj,et al.  Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Tim Brookes,et al.  On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis , 2014 .

[18]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[20]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[23]  DeLiang Wang,et al.  An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. , 2017, The Journal of the Acoustical Society of America.

[24]  John R. Hershey,et al.  Phasebook and Friends: Leveraging Discrete Representations for Source Separation , 2018, IEEE Journal of Selected Topics in Signal Processing.

[25]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[27]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Tatsuya Kawahara,et al.  Bayesian Multichannel Audio Source Separation Based on Integrated Source and Spatial Models , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[30]  Zhong-Qiu Wang,et al.  Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation , 2018, INTERSPEECH.

[31]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Marek R. Ogiela,et al.  Multimedia tools and applications , 2005, Multimedia Tools and Applications.

[33]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Nicolás Ruiz-Reyes,et al.  Monophonic constrained non-negative sparse coding using instrument models for audio separation and transcription of monophonic source-based polyphonic mixtures , 2013, Multimedia Tools and Applications.

[35]  Ruimin Hu,et al.  Multi-speakers Speech Separation Based on Modified Attractor Points Estimation and GMM Clustering , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[36]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[37]  Yi-Hsuan Yang,et al.  Complex and Quaternionic Principal Component Pursuit and Its Application to Audio Separation , 2016, IEEE Signal Processing Letters.

[38]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  VirtanenTuomas Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007 .

[40]  Jun Du,et al.  A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  GannotSharon,et al.  Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function , 2019 .

[42]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).