Vocal Harmony Separation Using Time-Domain Neural Networks

Polyphonic vocal recordings are an inherently challenging source separation task due to the melodic structure of the vocal parts and unique timbre of its constituents. In this work we utilise a time-domain neural network architecture re-purposed from speech separation research and modify it to separate a capella mixtures at a high sampling rate. We use four-part (soprano, alto, tenor and bass) a capella recordings of Bach Chorales and Barbershop Quartets for our experiments. Unlike current deep learning based choral separation models where the training objective is to separate constituent sources based on their class, we train our model using a permutation invariant objective. Using this we achieve state-of-the-art results for choral music separation. We introduce a novel method to estimate harmonic overlap between sung musical notes as a measure of task complexity. We also present an analysis of the impact of randomised mixing, input lengths and filterbank lengths for our task. Our results show a moderate negative correlation between the harmonic overlap of the target sources and source separation performance. We report that training our models with randomly mixed musically-incoherent mixtures drastically reduces the performance of vocal harmony separation as it decreases the average harmonic overlap presented during training.

[1]  Jonathan Le Roux,et al.  Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jordi Bonada,et al.  Deep Learning Based Source Separation Applied To Choir Ensembles , 2020, ISMIR.

[4]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Naoya Takahashi,et al.  Recursive speech separation for unknown number of speakers , 2019, INTERSPEECH.

[6]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[7]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[8]  Naoya Takahashi,et al.  D3Net: Densely connected multidilated DenseNet for music source separation , 2020, ArXiv.

[9]  P. Depalle,et al.  Score-Informed Source Separation of Choral Music , 2019 .

[10]  Emmanouil Benetos,et al.  Automatic Transcription of a Cappella Recordings from Multiple Singers , 2017 .

[11]  Simon Dixon,et al.  Analysis of Interactive Intonation in Unaccompanied SATB Ensembles , 2017, ISMIR.

[12]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Romain Hennequin,et al.  Spleeter: a fast and efficient music source separation tool with pre-trained models , 2020, J. Open Source Softw..

[15]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Mark D. Plumbley,et al.  Musical Source Separation: An Introduction , 2019, IEEE Signal Processing Magazine.

[17]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[18]  Efthymios Tzinis,et al.  Asteroid: the PyTorch-based audio source separation toolkit for researchers , 2020, INTERSPEECH.

[19]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[20]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Agustín Martorell Domínguez,et al.  Analysis of intonation in unison choir singing , 2018 .

[22]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[23]  Klaus Frieler,et al.  Influence of virtual room acoustics on choir singing , 2015 .

[24]  Antoine Liutkus,et al.  The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.

[25]  Reinhold Haeb-Umbach,et al.  Demystifying TasNet: A Dissecting Approach , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.