Multi-instrument Music Synthesis with Spectrogram Diffusion

An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualita-tively and as measured by audio reconstruction and Fréchet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.

[1]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[2]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[3]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.

[4]  Zeyu Jin,et al.  Music Enhancement via Image Translation and Vocoding , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  P. Esling,et al.  Streamable Neural Audio Synthesis With Non-Causal Convolutions , 2022, ArXiv.

[6]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[7]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[8]  Albert Gu,et al.  It's Raw! Audio Generation with State-Space Models , 2022, ICML.

[9]  Oriol Vinyals,et al.  General-purpose, long-context autoregressive modeling with Perceiver AR , 2022, ICML.

[10]  Tim Salimans,et al.  Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[11]  Cheng-Zhi Anna Huang,et al.  MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling , 2021, ArXiv.

[12]  Jesse Engel,et al.  MT3: Multi-Task Multitrack Music Transcription , 2021, ICLR.

[13]  Aaron C. Courville,et al.  Chunked Autoregressive GAN for Conditional Waveform Synthesis , 2021, ICLR.

[14]  Vadim Popov,et al.  Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme , 2021, ICLR.

[15]  Marco Tagliasacchi,et al.  SoundStream: An End-to-End Neural Audio Codec , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Philippe Esling,et al.  RAVE: A variational autoencoder for fast and high-quality neural audio synthesis , 2021, ArXiv.

[17]  Curtis Hawthorne,et al.  Sequence-to-Sequence Piano Transcription with Transformers , 2021, ISMIR.

[18]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[19]  Gaetan Hadjeres,et al.  CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis , 2021, ISMIR.

[20]  Tamar Rott Shaham,et al.  Catch-A-Waveform: Learning to Generate Audio from a Single Short Example , 2021, NeurIPS.

[21]  Songxiang Liu,et al.  DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Tasnima Sadekova,et al.  Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.

[23]  J. Yamagishi,et al.  Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis , 2021, 11th ISCA Speech Synthesis Workshop (SSW 11).

[24]  Junhyeok Lee,et al.  NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling , 2021, Interspeech 2021.

[25]  Nam Soo Kim,et al.  Diff-TTS: A Denoising Diffusion Model for Text-to-Speech , 2021, Interspeech.

[26]  Curtis Hawthorne,et al.  Symbolic Music Generation with Diffusion Models , 2021, ISMIR.

[27]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[28]  Ron J. Weiss,et al.  Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[30]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[31]  Zhou Zhao,et al.  DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis , 2021, ArXiv.

[32]  Kou Tanaka,et al.  VoiceGrad: Non-Parallel Any-to-Many Voice Conversion With Annealed Langevin Dynamics , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Dominik Roblek,et al.  SEANet: A Multi-modal Speech Enhancement Network , 2020, INTERSPEECH.

[34]  Jasper Snoek,et al.  A Spectral Energy Distance for Parallel Speech Synthesis , 2020, NeurIPS.

[35]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[36]  Curtis Hawthorne,et al.  Self-supervised Pitch Detection by Inverse Audio Synthesis , 2020 .

[37]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[38]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[39]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[40]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[41]  Prem Seetharaman,et al.  Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Aäron van den Oord,et al.  Towards realistic MIDI instrument synthesizers , 2020 .

[43]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[44]  Jonathan Le Roux,et al.  Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[45]  Dominik Roblek,et al.  Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms , 2019, INTERSPEECH.

[46]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[47]  Jong Wook Kim,et al.  Neural Music Synthesis for Flexible Timbre Control , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[50]  Lior Wolf,et al.  A Universal Music Translation Network , 2018, ICLR.

[51]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[52]  Gaurav Sharma,et al.  Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications , 2016, IEEE Transactions on Multimedia.

[53]  Chenjie Gu,et al.  Fast and Flexible Neural Audio Synthesis , 2019, ISMIR.

[54]  Nicolas Usunier,et al.  SING: Symbol-to-Instrument Neural Generator , 2018, NeurIPS.

[55]  Brian Kulis,et al.  Conditioning Deep Generative Raw Audio Models for Structured Automatic Music , 2018, ISMIR.

[56]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[57]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[58]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[60]  Johan Pauwels,et al.  GuitarSet: A Dataset for Guitar Transcription , 2018, ISMIR.

[61]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[64]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[65]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[66]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[67]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[69]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[70]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[71]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[72]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.