ATT: Attention-based Timbre Transfer

In this paper, we tackle the issue of timbre transfer on a given monophonic music sample. The objective is to change the timbre of source audio from one instrument to another while preserving features such as loudness, pitch, and rhythm. Existing approaches use image-to-image translation techniques on the entire region of time-frequency representations of the raw audio wave, which may lead to the addition of unwanted elements in the final audio waveform. We propose Attention-based Timbre Transfer (ATT), an attention-based pipeline for transferring timbre. To the best of our knowledge, ATT is the first approach which leverages attention for achieving timbre transfer. Further, ATT uses MelGAN for spectrogram inversion, which provides a fast and parallel alternative to other autoregressive music generation approaches, without compromising on the quality. ATT shows promising results, thus efficaciously transferring timbre with minimal offset to other physical characteristics.

[1]  Cem Anil,et al.  TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer , 2018, ICLR.

[2]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[3]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[4]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[6]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[7]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[8]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[10]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Kwang In Kim,et al.  Unsupervised Attention-guided Image to Image Translation , 2018, NeurIPS.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Hyeong-seok Choi,et al.  Singing Voice Separation using Generative Adversarial Networks , 2017 .

[14]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lior Wolf,et al.  A Universal Music Translation Network , 2018, ICLR.