论文信息 - Wavesplit: End-to-End Speech Separation by Speaker Clustering

Wavesplit: End-to-End Speech Separation by Speaker Clustering

We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the model infers a representation for each source and then estimates each source signal given the inferred representations. The model is trained to jointly perform both tasks from the raw waveform. Wavesplit infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation. For speech separation, our sequence-wide speaker representations provide a more robust separation of long, challenging recordings compared to prior work. Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2/3mix), as well as in noisy and reverberated settings (WHAM/WHAMR). We also set a new benchmark on the recent LibriMix dataset. Finally, we show that Wavesplit is also applicable to other domains, by separating fetal and maternal heart rates from a single abdominal electrocardiogram.

Neil Zeghidour | David Grangier | David Grangier | Neil Zeghidour

[1] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Darrell Williamson. Discrete-time signal processing: an algebraic approach , 1999 .

[3] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4] D I Fotiadis,et al. A Non-invasive Methodology for Fetal Monitoring during Pregnancy , 2009, Methods of Information in Medicine.

[5] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[7] Bruno Torrésani,et al. A review of blind source separation in NMR spectroscopy. , 2014, Progress in nuclear magnetic resonance spectroscopy.

[8] Scott Rickard,et al. Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[9] Zhong-Qiu Wang,et al. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[10] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Gari D Clifford,et al. An open-source framework for stress-testing non-invasive foetal ECG extraction algorithms , 2016, Physiological measurement.

[12] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[14] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[15] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Liu Liu,et al. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[17] John R. Hershey,et al. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[18] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Haizhou Li,et al. Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Jonathan Le Roux,et al. WHAMR!: Noisy and Reverberant Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] V. Stanković,et al. An electrical load measurements dataset of United Kingdom households from a two-year longitudinal study , 2017, Scientific Data.

[23] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[24] Dong Yu,et al. Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Antoine Deleforge,et al. LibriMix: An Open-Source Dataset for Generalizable Speech Separation , 2020, 2005.11262.

[26] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27] Reza Sameni,et al. Noninvasive fetal ECG: The PhysioNet/Computing in Cardiology Challenge 2013 , 2013, Computing in Cardiology 2013.

[28] Hiroshi Sawada,et al. Underdetermined blind speech separation with directivity pattern based continuous mask and ICA , 2004, 2004 12th European Signal Processing Conference.

[29] P. Sutha,et al. Fetal Electrocardiogram Extraction and Analysis Using Adaptive Noise Cancellation and Wavelet Transformation Techniques , 2017, Journal of Medical Systems.

[30] Jon Barker,et al. The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[31] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[32] DeLiang Wang,et al. Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[34] Sam T. Roweis,et al. One Microphone Source Separation , 2000, NIPS.

[35] Jonathan Le Roux,et al. Novel Deep Architectures in Speech Processing , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[36] Robert M. Gray,et al. An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[37] Andrey I. Kibzun,et al. Guaranteeing approach to solving quantile optimization problems , 1991, Ann. Oper. Res..

[38] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Nicolas Usunier,et al. Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed , 2019, ArXiv.

[40] Christopher J. Shallue,et al. Identifying Exoplanets with Deep Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet around Kepler-90 , 2017, 1712.05044.

[41] Yossi Adi,et al. Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[42] Jonathan Le Roux,et al. The Phasebook: Building Complex Masks via Discrete Representations for Source Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45] Antoine Deleforge,et al. Filterbank Design for End-to-end Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[47] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[48] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49] Jonathan Le Roux,et al. WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[50] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[51] Fabian-Robert Stöter,et al. MUSDB18 - a corpus for music separation , 2017 .

[52] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Jun Wang,et al. Mixup-breakdown: A Consistency Training Method for Improving Generalization of Speech Separation Models , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54] Emmanuel Vincent,et al. Audio Source Separation and Speech Enhancement , 2018 .

[55] Jean Carletta,et al. The AMI meeting corpus , 2005 .

[56] Franck Giron,et al. Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58] Peng Gao,et al. CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59] Vasile Palade,et al. Sokoto Coventry Fingerprint Dataset , 2018, ArXiv.

[60] Jon Barker,et al. The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61] João Bernardes,et al. The persistent challenge of foetal heart rate monitoring , 2010, Current opinion in obstetrics & gynecology.

[62] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63] Zhong-Qiu Wang,et al. Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64] Nicolas Usunier,et al. Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[65] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[66] Philipos C. Loizou,et al. Speech Enhancement: Theory and Practice , 2007 .

[67] Cordelia Schmid,et al. Spreading vectors for similarity search , 2018, ICLR.

[68] Takuya Yoshioka,et al. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.