A general-purpose deep learning approach to model time-varying audio effects

Audio processors whose parameters are modified periodically over time are often referred as time-varying or modulation based audio effects. Most existing methods for modeling these type of effect units are often optimized to a very specific circuit and cannot be efficiently generalized to other time-varying effects. Based on convolutional and recurrent neural networks, we propose a deep learning architecture for generic black-box modeling of audio processors with long-term memory. We explore the capabilities of deep neural networks to learn such long temporal dependencies and we show the network modeling various linear and nonlinear, time-varying and time-invariant audio effects. In order to measure the performance of the model, we propose an objective metric based on the psychoacoustics of modulation frequency perception. We also analyze what the model is actually learning and how the given task is accomplished.

[1]  Joshua D. Reiss,et al.  Modeling Nonlinear Audio Effects with End-to-end Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Les E. Atlas,et al.  Modulation-scale analysis for content identification , 2004, IEEE Transactions on Signal Processing.

[3]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[4]  Kurt James Werner,et al.  A Computational model of the Hammond Organ Vibrato/Chorus Using Wave Digital Filters , 2016 .

[5]  Jyri Tapani Pakarinen,et al.  A Review of Digital Techniques for Modeling Vacuum-Tube Guitar Amplifiers , 2009, Computer Music Journal.

[6]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[7]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[8]  Marco A. Martínez Ramírez,et al.  End-to-end equalization with convolutional neural networks , 2018 .

[9]  Kurt James Werner,et al.  Modeling Circuits With Operational Transconductance Amplifiers Using Wave Digital Filters , 2017 .

[10]  Vesa Välimäki,et al.  Time-variant gray-box modeling of a phaser pedal , 2016 .

[11]  Joel H. Saltz,et al.  ConvNets with Smooth Adaptive Activation Functions for Regression , 2017, AISTATS.

[12]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[13]  David T. Yeh Automated Physical Modeling of Nonlinear Audio Circuits for Real-Time Audio Effects—Part II: BJT and Vacuum Tube Examples , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Joshua D. Reiss,et al.  Ten Years of Automatic Mixing , 2017 .

[15]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[16]  Udo Zoelzer,et al.  DAFX: Digital Audio Effects , 2011 .

[17]  Vesa Välimäki,et al.  Computationally Efficient Hammond Organ Synthesis , 2011 .

[18]  Joshua D. Reiss,et al.  Digital Dynamic Range Compressor Design—A Tutorial and Analysis , 2012 .

[19]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer , 2017, INTERSPEECH.

[20]  Martin Holters,et al.  Physical Modeling of the MXR Phase 90 Guitar Effect Pedal , 2014, DAFx.

[21]  Julius O. Smith,et al.  Automated Physical Modeling of Nonlinear Audio Circuits For Real-Time Audio Effects—Part I: Theoretical Development , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jakob Abeßer,et al.  Automatic Detection of Audio Effects in Guitar and Bass Recordings , 2010 .

[23]  Martin Holters,et al.  PHYSICAL MODELLING OF A WAH-WAH EFFECT PEDAL AS A CASE STUDY FOR APPLICATION OF THE NODAL DK METHOD TO CIRCUITS WITH VARIABLE PARTS , 2011 .

[24]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[25]  Smith,et al.  Physical audio signal processing : for virtual musical instruments and audio effects , 2010 .

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Julian Parker A SIMPLE DIGITAL MODEL OF THE DIODE-BASED RING-MODULATOR , 2011 .

[28]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[29]  Jessica Fuerst Audio Effects Theory Implementation And Application , 2016 .

[30]  Juhan Nam,et al.  Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[32]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Julius O. Smith,et al.  Doppler Simulation and the Leslie , 2002 .

[34]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Colin Raffel PRACTICAL MODELING OF BUCKET-BRIGADE DEVICE CIRCUITS , 2010 .

[36]  M. Holters A COMBINED MODEL FOR A BUCKET BRIGADE DEVICE AND ITS INPUT AND OUTPUT FILTERS , 2018 .

[37]  Jonah Casebeer,et al.  Adaptive Front-ends for End-to-end Source Separation , 2017 .