Neural Waveshaping Synthesis

We present the Neural Waveshaping Unit (NEWT): a novel, lightweight, fully causal approach to neural audio synthesis which operates directly in the waveform domain, with an accompanying optimisation (FastNEWT) for efficient CPU inference. The NEWT uses time-distributed multilayer perceptrons with periodic activations to implicitly learn nonlinear transfer functions that encode the characteristics of a target timbre. Once trained, a NEWT can produce complex timbral evolutions by simple affine transformations of its input and output signals. We paired the NEWT with a differentiable noise synthesiser and reverb and found it capable of generating realistic musical instrument performances with only 260k total model parameters, conditioned on F0 and loudness features. We compared our method to state-of-the-art benchmarks with a multistimulus listening test and the Fréchet Audio Distance and found it performed competitively across the tested timbral domains. Our method significantly outperformed the benchmarks in terms of generation speed, and achieved real-time performance on a consumer CPU, both with and without FastNEWT, suggesting it is a viable basis for future creative sound design tools.

[1]  Bowen Zhou,et al.  Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed , 2020, INTERSPEECH.

[2]  Dominik Roblek,et al.  Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms , 2019, INTERSPEECH.

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  The control-synthesis approach for making expressive and controllable neural music synthesizers , 2020 .

[5]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[6]  Lior Wolf,et al.  Hierarchical Timbre-Painting and Articulation Generation , 2020, ArXiv.

[7]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[8]  Xavier Serra,et al.  A sound analysis/synthesis system based on a deterministic plus stochastic decomposition , 1990 .

[9]  Lauri Juvela,et al.  Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[13]  Vaibhav Kumar,et al.  ATT: Attention-based Timbre Transfer , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[14]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[15]  J. Nistal,et al.  DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks , 2020, ArXiv.

[16]  Cem Anil,et al.  TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer , 2018, ICLR.

[17]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[18]  Marc Le Brun,et al.  Digital Waveshaping Synthesis , 1979 .

[19]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[20]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[21]  Joshua D. Reiss,et al.  Adaptive Control of Amplitude Distortion Effects , 2014, Semantic Audio.

[22]  Jong Wook Kim,et al.  Neural Music Synthesis for Flexible Timbre Control , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Gaurav Sharma,et al.  Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications , 2016, IEEE Transactions on Multimedia.

[24]  Method for the subjective assessment of intermediate quality level of , 2014 .

[25]  Mark Hoogendoorn,et al.  CKConv: Continuous Kernel Convolution For Sequential Data , 2021, ArXiv.

[26]  Adrien Bitton,et al.  Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces , 2018, ISMIR.

[27]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[29]  Xin Wang,et al.  Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Adrien Bardet,et al.  Flow Synthesizer: Universal Audio Synthesizer Control with Normalizing Flows , 2019, Applied Sciences.

[32]  Chenjie Gu,et al.  Fast and Flexible Neural Audio Synthesis , 2019, ISMIR.

[33]  Roberta Bianco,et al.  An online headphone screening test based on dichotic pitch , 2020, Behavior Research Methods.

[34]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[35]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[36]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.