Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

We propose FedEnhance, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a realworld scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.

[1]  Compute and memory efficient universal sound source separation , 2021, ArXiv.

[2]  Xavier Serra,et al.  FSD50K: an Open Dataset of Human-Labeled Sound Events , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Zhong-Qiu Wang,et al.  Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[6]  Aswin Sivaraman,et al.  Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification , 2021, Interspeech.

[7]  Ron J. Weiss,et al.  Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[8]  Klaus-Robert Müller,et al.  Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sascha Grollmisch,et al.  DESED-FL and URBAN-FL: Federated Learning Datasets for Sound Event Detection , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[11]  Yi Luo,et al.  Ultra-Lightweight Speech Separation Via Group Communication , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[13]  Efthymios Tzinis,et al.  Unsupervised Deep Clustering for Source Separation: Direct Learning from Mixtures Using Spatial Information , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Tanir Ozcelebi,et al.  Towards federated unsupervised representation learning , 2020, EdgeSys@EuroSys.

[15]  Jonathan Le Roux,et al.  Bootstrapping Single-channel Source Separation via Unsupervised Spatial Clustering on Stereo Mixtures , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[17]  Zhu Han,et al.  Unsupervised Federated Learning for Unbalanced Data , 2020, GLOBECOM 2020 - 2020 IEEE Global Communications Conference.

[18]  Nasser Kehtarnavaz,et al.  A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection , 2018, IEEE Access.

[19]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[20]  P. Smaragdis,et al.  Sudo RM -RF: Efficient Networks for Universal Audio Source Separation , 2020, International Workshop on Machine Learning for Signal Processing.

[21]  Françoise Beaufays,et al.  Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Joseph Dureau,et al.  Federated Learning for Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jonathan Le Roux,et al.  Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Qiuqiang Kong,et al.  Speech enhancement with weakly labelled data from AudioSet , 2021, Interspeech 2021.

[26]  Jean-Marc Valin,et al.  PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Reinhold Häb-Umbach,et al.  Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[30]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[32]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).