Efficient Neural Networks for Real-time Analog Audio Effect Modeling

Deep learning approaches have demonstrated success in the task of modeling analog audio effects such as distortion and overdrive. Nevertheless, challenges remain in modeling more complex effects, such as dynamic range compressors, along with their variable parameters. Previous methods are computationally complex, and noncausal, prohibiting real-time operation, which is critical for use in audio production contexts. They additionally utilize large training datasets, which are time-intensive to generate. In this work, we demonstrate that shallower temporal convolutional networks (TCNs) that exploit very large dilation factors for significant receptive field can achieve state-of-the-art performance, while remaining efficient. Not only are these models found to be perceptually similar to the original effect, they achieve a 4x speedup, enabling real-time operation on CPU, and can be trained using only 1% of the data from previous methods.

[1]  Mark B. Sandler,et al.  A History of Audio Effects , 2020, Applied Sciences.

[2]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[3]  Scott H. Hawley,et al.  Exploring Quality and Generalizability in Parameterized Neural Audio Effects , 2020, ArXiv.

[4]  Wei Chen,et al.  Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech , 2020, ArXiv.

[5]  Shan Liu,et al.  TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis. , 2020, 2011.12206.

[6]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  David L. Livingston,et al.  A vacuum-tube guitar amplifier model using a recurrent neural network , 2013, 2013 Proceedings of IEEE Southeastcon.

[8]  Scott H. Hawley,et al.  Profiling Audio Compressors with Deep Neural Networks , 2019 .

[9]  Gabriel Meseguer-Brocal,et al.  Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for Multiple Source Separations , 2019, ISMIR.

[10]  Julius O. Smith,et al.  Automated Physical Modeling of Nonlinear Audio Circuits For Real-Time Audio Effects—Part I: Theoretical Development , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Vesa Välimäki,et al.  Real-time black-box modelling with recurrent neural networks , 2019 .

[12]  Jatin Chowdhury A Comparison of Virtual Analog Modelling Techniques for Desktop and Embedded Implementations , 2020, ArXiv.

[13]  C. Steinmetz Learning to mix with neural audio effects in the waveform domain , 2020 .

[14]  Thomas Schmitz,et al.  Nonlinear Real-Time Emulation of a Tube Amplifier with a Long Short Time Memory Neural-Network , 2018 .

[15]  Matti Karjalainen,et al.  Wave Digital Simulation of a Vacuum-Tube Amplifier , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Joan Serra,et al.  Automatic Multitrack Mixing With A Differentiable Mixing Console Of Neural Audio Effects , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Stephen A. Billings,et al.  Identification of systems containing linear dynamic and static nonlinear elements , 1982, Autom..

[18]  Method for the subjective assessment of intermediate quality level of , 2014 .

[19]  Yu Tsao,et al.  Learning With Learned Loss Function: Speech Enhancement With Quality-Net to Improve Perceptual Evaluation of Speech Quality , 2019, IEEE Signal Processing Letters.

[20]  Udo Zölzer,et al.  Virtual Analog Modeling of a UREI 1176LN Dynamic Range Control System , 2017 .

[21]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[22]  Emmanouil Benetos,et al.  Deep Learning for Black-Box Modeling of Audio Effects , 2020, Applied Sciences.

[23]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[24]  Joshua D. Reiss,et al.  Digital Dynamic Range Compressor Design—A Tutorial and Analysis , 2012 .

[25]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Gregory Diamos,et al.  Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks , 2018, IEEE Signal Processing Letters.

[27]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Marco A. Martínez Ramírez,et al.  End-to-end equalization with convolutional neural networks , 2018 .

[29]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .

[30]  Lauri Juvela,et al.  Deep Learning for Tube Amplifier Emulation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Udo Zölzer,et al.  Virtual Analog Modeling of Dynamic Range Compression Systems , 2017 .

[32]  Joshua D. Reiss,et al.  AN AUTONOMOUS METHOD FOR MULTI-TRACK DYNAMIC RANGE COMPRESSION TEMPLATES FOR DAFX-08, FINLAND, FRANCE , 2012 .

[34]  Vesa Välimäki,et al.  Perceptual loss function for neural modeling of audio systems , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jordi Bonada,et al.  Deep Learning Based Source Separation Applied To Choir Ensembles , 2020, ISMIR.

[36]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[37]  Joshua D. Reiss,et al.  Modeling Nonlinear Audio Effects with End-to-end Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[39]  Lauri Juvela,et al.  Real-Time Modeling of Audio Distortion Circuits with Deep Learning , 2019 .

[40]  David L. Livingston,et al.  A Vacuum-Tube Guitar Amplifier Model Using Long/Short-Term Memory Networks , 2018, SoutheastCon 2018.

[41]  Joshua D. Reiss,et al.  A general-purpose deep learning approach to model time-varying audio effects , 2019, ArXiv.

[42]  Joshua D. Reiss,et al.  Intelligent Multitrack Dynamic Range Compression , 2015 .

[43]  Vesa Välimäki,et al.  Introduction to the Special Issue on Virtual Analog Audio Effects and Musical Instruments , 2010, IEEE Trans. Speech Audio Process..

[44]  Simone Orcioni,et al.  Identification of Volterra Models of Tube Audio Devices using Multiple-Variance Method , 2018, Journal of the Audio Engineering Society.