Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as a training target, not required for inference. Our model consists of a separator that outputs a time-frequency mask for each instrument, and a transcriptor that acts as a critic, providing both temporal and frequency supervision to guide the learning of the separator. A harmonic mask constraint is introduced as another way of leveraging score information during training, and we propose two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator. Results demonstrate that using score information outperforms temporal weak-labels, and adversarial structures lead to further improvements in both separation and transcription performance.

[1]  Yi-Hsuan Yang,et al.  Multitask Learning for Frame-level Instrument Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Colin Raffel,et al.  Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching , 2016 .

[3]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[4]  Ning Zhang,et al.  Weakly Supervised Audio Source Separation via Spectrum Energy Preserved Wasserstein Learning , 2018, IJCAI.

[5]  Efthymios Tzinis,et al.  Improving Universal Sound Separation Using Sound Classification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[7]  Jordi Janer,et al.  Score-Informed Source Separation for Multichannel Orchestral Recordings , 2016, J. Electr. Comput. Eng..

[8]  Ichiro Fujinaga,et al.  Deep Neural Networks for Document Processing of Music Score Images , 2018 .

[9]  Emilia Gómez,et al.  End-to-end Sound Source Separation Conditioned on Instrument Labels , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Simon Dixon,et al.  Adversarial Semi-Supervised Audio Source Separation Applied to Singing Voice Extraction , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[12]  Ali Taylan Cemgil,et al.  Audio Source Separation Using Variational Autoencoders and Weak Class Supervision , 2019, IEEE Signal Processing Letters.

[13]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Emilia Gómez,et al.  Monaural Score-Informed Source Separation for Classical Music Using Convolutional Neural Networks , 2017, ISMIR.

[15]  Mark B. Sandler,et al.  Structured dropout for weak label and multi-instance learning and its application to score-informed source separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Lior Wolf,et al.  Semi-supervised Monaural Singing Voice Separation with a Masking Network Trained on Synthetic Mixtures , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Matthias Mauch,et al.  MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[19]  Jonathan Le Roux,et al.  Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[22]  Bryan Pardo,et al.  Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Mark D. Plumbley,et al.  Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Dan Stowell,et al.  Denoising without access to clean data using a partitioned autoencoder , 2015, ArXiv.

[27]  Roland Badeau,et al.  Weakly Informed Audio Source Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[28]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[29]  Fabian-Robert Stöter,et al.  MUSDB18 - a corpus for music separation , 2017 .

[30]  Jonathan Le Roux,et al.  Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[31]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Yi-Hsuan Yang,et al.  Escaping from the Abyss of Manual Annotation: New Methodology of Building Polyphonic Datasets for Automatic Music Transcription , 2015, CMMR.

[33]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.