A Pitch-aware Approach to Single-channel Speech Separation

Despite significant advancements of deep learning on separating speech sources mixed in a single channel, same gender speaker mix, i.e., male-male or female-female, is still more difficult to separate than the case of opposite gender mix. In this study, we propose a pitch-aware speech separation approach to improve the speech separation performance. The proposed approach performs speech separation in the following steps: 1) training a pre-separation model to separate the mixed sources; 2) training a pitch-tracking network to perform polyphonic pitch tracking; 3) incorporating the estimated pitch for the final pitch-aware speech separation. Experimental results of the new approach, tested on the WSJ0-2mix public dataset, show that the new approach improves speech separation performance for both same and opposite gender mixture. The improved performance in signal-to-distortion (SDR) of 12.0 dB is the best reported result without using any phase enhancement.

[1]  Tuomas Virtanen,et al.  Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music , 2008, SAPA@INTERSPEECH.

[2]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[3]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[4]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[5]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  DeLiang Wang,et al.  A Casa Approach to Deep Learning Based Speaker-Independent Co-Channel Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[11]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[12]  DeLiang Wang,et al.  Separation of singing voice from music accompaniment for monaural recordings , 2007 .

[13]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  DeLiang Wang,et al.  Permutation Invariant Training for Speaker-Independent Multi-Pitch Tracking , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Nima Mesgarani,et al.  TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .