论文信息 - Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

We propose FedEnhance, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a realworld scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.

[1] Compute and memory efficient universal sound source separation , 2021, ArXiv.

[2] Xavier Serra,et al. FSD50K: an Open Dataset of Human-Labeled Sound Events , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Zhong-Qiu Wang,et al. Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Jonathan Le Roux,et al. WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[6] Aswin Sivaraman,et al. Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification , 2021, Interspeech.

[7] Ron J. Weiss,et al. Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[8] Klaus-Robert Müller,et al. Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[9] Scott Wisdom,et al. Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Sascha Grollmisch,et al. DESED-FL and URBAN-FL: Federated Learning Datasets for Sound Event Detection , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[11] Yi Luo,et al. Ultra-Lightweight Speech Separation Via Group Communication , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[13] Efthymios Tzinis,et al. Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures Using Spatial Information , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Tanir Ozcelebi,et al. Towards federated unsupervised representation learning , 2020, EdgeSys@EuroSys.

[15] Jonathan Le Roux,et al. Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Richard Nock,et al. Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[17] Zhu Han,et al. Unsupervised Federated Learning for Unbalanced Data , 2020, GLOBECOM 2020 - 2020 IEEE Global Communications Conference.

[18] Nasser Kehtarnavaz,et al. A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection , 2018, IEEE Access.

[19] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[20] P. Smaragdis,et al. Sudo RM -RF: Efficient Networks for Universal Audio Source Separation , 2020, International Workshop on Machine Learning for Signal Processing.

[21] Françoise Beaufays,et al. Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Joseph Dureau,et al. Federated Learning for Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Jonathan Le Roux,et al. Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Qiuqiang Kong,et al. Speech enhancement with weakly labelled data from AudioSet , 2021, Interspeech 2021.

[26] Jean-Marc Valin,et al. PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28] Reinhold Häb-Umbach,et al. Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Naoya Takahashi,et al. D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[30] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Yossi Adi,et al. Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[32] Mirco Ravanelli,et al. Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).