A Vocoder Based Method for Singing Voice Extraction

This paper presents a novel method for extracting the vocal track from a musical mixture. The musical mixture consists of a singing voice and a backing track which may comprise of various instruments. We use a convolutional network with skip and residual connections as well as dilated convolutions to estimate vocoder parameters, given the spectrogram of an input mixture. The estimated parameters are then used to synthesize the vocal track, without any interference from the backing track. We evaluate our system, through objective metrics pertinent to audio quality and interference from background sources, and via a comparative subjective evaluation. We use open-source source separation systems based on Non-negative Matrix Factorization (NMFs) and Deep Learning methods as benchmarks for our system and discuss future applications for this particular algorithm.

[1]  Neil Joseph Miller Removal of Noise from a Voice Signal by Synthesis , 1973 .

[2]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[3]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[6]  Gaël Richard,et al.  Main instrument separation from stereophonic audio signals using a source/filter model , 2009, 2009 17th European Signal Processing Conference.

[7]  A. Oppenheim,et al.  Homomorphic analysis of speech , 1968 .

[8]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[9]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[10]  Yi-Hsuan Yang,et al.  Vocal activity informed singing voice separation with the iKala dataset , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mark D. Plumbley,et al.  Single Channel Audio Source Separation using Deep Neural Network Ensembles , 2016 .

[12]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[16]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[17]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[18]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[20]  Xabier Jaureguiberry,et al.  The Flexible Audio Source Separation Toolbox Version 2.0 , 2014, ICASSP 2014.

[21]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.