论文信息 - Singing Voice Separation with Deep U-Net Convolutional Networks

Singing Voice Separation with Deep U-Net Convolutional Networks

The decomposition of a music audio signal into its vocal and backing track components is analogous to image-to-image translation, where a mixed spectrogram is transformed into its constituent sources. We propose a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction. Through both quantitative evaluation and subjective assessment, experiments demonstrate that the proposed algorithm achieves state-of-the-art performance.

[1] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Xinlei Chen,et al. PixelNet: Towards a General Pixel-level Architecture , 2016, ArXiv.

[3] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Taehoon Kim,et al. Music Source Separation Using Stacked Hourglass Networks , 2018, ISMIR.

[5] Daniel P. W. Ellis,et al. MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[6] George Tzanetakis,et al. Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[7] Emilia Gómez,et al. Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[8] Mark D. Plumbley,et al. Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[9] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[10] Matthias Mauch,et al. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[11] Peter Kulchyski. and , 2015 .

[12] Tristan Jehan,et al. Mining Labeled Data from Web-Scale Collections for Vocal Activity Detection in Music , 2017, ISMIR.

[13] Philip Tagg,et al. Analysing popular music: theory, method and practice , 1982, Popular Music.

[14] Mark D. Plumbley,et al. Single channel audio source separation using convolutional denoising autoencoders , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[15] Tuomas Virtanen,et al. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Rémi Gribonval,et al. Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17] Seunghoon Hong,et al. Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Yi Luo. DEEP CLUSTERING FOR SINGING VOICE SEPARATION , 2016 .

[21] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[22] Paris Smaragdis,et al. Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[23] Paris Smaragdis,et al. Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[24] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[25] Daniel P. W. Ellis,et al. Echoprint: An Open Music Identification Service , 2011 .

[26] Emmanuel Vincent,et al. Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27] R. Likert. “Technique for the Measurement of Attitudes, A” , 2022, The SAGE Encyclopedia of Research Design.

[28] Anssi Klapuri,et al. Singer Identification in Polyphonic Music Using Vocal Separation and Pattern Recognition Methods , 2007, ISMIR.

[29] Nicola Orio,et al. Music Retrieval: A Tutorial and Review , 2006, Found. Trends Inf. Retr..

[30] Shankar Vembu,et al. Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[31] Bryan Pardo,et al. REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Jae Lim,et al. Signal estimation from modified short-time Fourier transform , 1984 .

[33] Tuomas Virtanen,et al. Automatic Recognition of Lyrics in Singing , 2010, EURASIP J. Audio Speech Music. Process..

[34] Jonathan Le Roux,et al. Deep clustering and conventional networks for music separation: Stronger together , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Roberto Cipolla,et al. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Ning Zhang,et al. Weakly Supervised Audio Source Separation via Spectrum Energy Preserved Wasserstein Learning , 2018, IJCAI.

[37] Yi-Hsuan Yang,et al. Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Thomas Sporer,et al. PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality , 2000 .