Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning frame- work for music source separation inspired by the HuBERT speech representation model. We first investigate the potential impact of the original HuBERT model by inserting an adapted version of it into the well-known Demucs V2 time-domain separation architecture. We then propose Pa -HuBERT, a time-frequency-domain self-supervised model, that we later use in combination with a Res- U-Net decoder for source separation. Pa -HuBERT uses primitive auditory features of music as unsupervised clustering labels to initialize the self-supervised pretraining process using the Free Music Archive (FMA) dataset. The resulting framework achieves better source-to-distortion ratio (SDR) performance on the MusDB18 test set than the original Demucs V2 and Res-U-Net models. We further demonstrate that it can boost performance with small amounts of supervised data. Ultimately, our proposed framework is an effective solution to the challenge of limited clean source data for music source separation.

[1]  Daniel C. Tompkins,et al.  BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.

[2]  Jong Won Shin,et al.  Exploring WavLM on Speech Enhancement , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[3]  Francisco Massa,et al.  Hybrid Transformers for Music Source Separation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yi Luo,et al.  Music Source Separation With Band-Split RNN , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Michael Auli,et al.  Masked Autoencoders that Listen , 2022, NeurIPS.

[6]  Andrew Hines,et al.  Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction , 2022, INTERSPEECH.

[7]  Joseph P. Turian,et al.  HEAR: Holistic Evaluation of Audio Representations , 2022, NeurIPS.

[8]  S. Dubnov,et al.  HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Soonyoung Jung,et al.  KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing , 2021, ArXiv.

[10]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[11]  J. Bello,et al.  Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  James R. Glass,et al.  SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.

[13]  Hongyuan Zhu,et al.  MusicBERT: A Self-supervised Learning of Music Representation , 2021, ACM Multimedia.

[14]  Qiuqiang Kong,et al.  Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation , 2021, ISMIR.

[15]  S. Uhlich,et al.  Music Demixing Challenge 2021 , 2021, Frontiers in Signal Processing.

[16]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Yist Y. Lin,et al.  Utilizing Self-supervised Representations for MOS Prediction , 2021, Interspeech.

[18]  Andrew N. Carr,et al.  Self-Supervised Learning of Audio Representations From Permutations With Differentiable Ranking , 2021, IEEE Signal Processing Letters.

[19]  Chao Wang,et al.  Multi-Task Self-Supervised Pre-Training for Music Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[21]  Romain Hennequin,et al.  Spleeter: a fast and efficient music source separation tool with pre-trained models , 2020, J. Open Source Softw..

[22]  Jonathan Le Roux,et al.  Bootstrapping Unsupervised Deep Music Separation from Primitive Auditory Grouping Principles , 2020 .

[23]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Heiga Zen,et al.  Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[25]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[26]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[27]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[28]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[29]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[30]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[31]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Bryan Pardo,et al.  Music/Voice separation using the 2D fourier transform , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[35]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[36]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[38]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Hirokazu Kameoka,et al.  Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram , 2008, 2008 16th European Signal Processing Conference.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Bryan Pardo,et al.  The Northwestern University Source Separation Library , 2018, ISMIR.

[42]  Sascha Disch,et al.  Extending Harmonic-Percussive Separation of Audio Signals , 2014, ISMIR.

[43]  Bryan Pardo,et al.  REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.