论文信息 - Spectrogram Feature Losses for Music Source Separation

Spectrogram Feature Losses for Music Source Separation

In this paper we study deep learning-based music source separation, and explore using an alternative loss to the standard spectrogram pixel-level L2 loss for model training. Our main contribution is in demonstrating that adding a high-level feature loss term, extracted from the spectrograms using a VGG net, can improve separation quality vis-a-vis a pure pixel-level loss. We show this improvement in the context of the MMDenseNet, a State-of-the-Art deep learning model for this task, for the extraction of drums and vocal sounds from songs in the musdb18 database, covering a broad range of western music genres. We believe that this finding can be generalized and applied to broader machine learning-based systems in the audio domain.

Brian McWilliams | Abhimanyu Sahai | Romann Weber

[1] Naoya Takahashi,et al. Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[2] Antoine Liutkus,et al. The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[3] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5] Patrick Pérez,et al. Audio Style Transfer , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[7] Julius O. Smith,et al. Neural Style Transfer for Audio Spectograms , 2018, ArXiv.