Dilated Convolution with Dilated GRU for Music Source Separation

Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is to separate different sound sources while maintaining the quality of the separated sounds. Therefore, this paper investigates using stacked dilated convolutions as the backbone for music source separation. However, while stacked dilated convolutions can reach wider context than standard convolutions, their effective receptive fields are still fixed and may not be wide enough for complex music audio signals. To reach information at remote locations, we propose to combine dilated convolution with a modified version of gated recurrent units (GRU) called the `Dilated GRU' to form a block. A Dilated GRU unit receives information from k steps before instead of the previous step for a fixed k. This modification allows a GRU unit to reach a location with fewer recurrent steps and run faster because it can execute partially in parallel. We show that the proposed model with a stack of such blocks performs equally well or better than the state-of-the-art models for separating vocals and accompaniments.

[1]  Naoya Takahashi,et al.  Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[2]  Jouni Paulus,et al.  Drum transcription with non-negative spectrogram factorisation , 2005, 2005 13th European Signal Processing Conference.

[3]  Franck Giron,et al.  Improving music source separation based on deep neural networks through data augmentation and network blending , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[5]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[6]  Naoya Takahashi,et al.  Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  Pierre Alexandre Tremblay,et al.  Improving Single-Network Single-Channel Separation of Musical Audio with Convolutional Layers , 2018, LVA/ICA.

[8]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[9]  Vaibhava Goel,et al.  Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition , 2016, ArXiv.

[10]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[11]  Yi-Hsuan Yang,et al.  Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[12]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[13]  Tijl De Bie,et al.  From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass , 2018, EURASIP J. Audio Speech Music. Process..

[14]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[15]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[16]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[17]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Yoshua Bengio,et al.  Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Antoine Liutkus,et al.  An Overview of Lead and Accompaniment Separation in Music , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[23]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[24]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[25]  Antoine Liutkus,et al.  The 2016 Signal Separation Evaluation Campaign , 2017, LVA/ICA.