Basic filters for convolutional neural networks applied to music: Training or design?

When convolutional neural networks are used to tackle learning problems based on music or other time series, raw one-dimensional data are commonly preprocessed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network’s performance and pose the question whether replacing it by applying adaptive or learned filters directly to the raw data can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging on the squared amplitudes is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on convolutional neural networks the features obtained from adaptive filter banks followed by time-averaging the squared modulus of the filters’ output perform better than the canonical Fourier transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.

[1]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[2]  Mark Sandler,et al.  The Effects of Noisy Labels on Deep Convolutional Neural Networks for Music Tagging , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[3]  Irène Waldspurger,et al.  Exponential decay of scattering coefficients , 2016, 2017 International Conference on Sampling Theory and Applications (SampTA).

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[7]  Helmut Bölcskei,et al.  Discrete Deep Feature Extraction: A Theory and New Architectures , 2016, ICML.

[8]  Bruno Torrésani,et al.  Representation of Operators in the Time-Frequency Domain and Generalized Gabor Multipliers , 2008, Structured Decompositions and Efficient Algorithms.

[9]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[10]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[11]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[12]  Monika Dörfler,et al.  Invariance and stability of Gabor scattering for music signals , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[13]  Thomas Grill,et al.  Music Boundary Detection Using Neural Networks on Combined Features and Two-Level Annotations , 2015, ISMIR.

[14]  H. Feichtinger,et al.  Quantization of TF lattice-invariant operators on elementary LCA groups , 1998 .

[15]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[16]  H. Feichtinger,et al.  A First Survey of Gabor Multipliers , 2003 .

[17]  Ching-Hua Chuan,et al.  Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks With a Novel Image-Based Representation , 2018, AAAI.

[18]  Matthieu Kowalski,et al.  Adapted and Adaptive Linear Time-Frequency Representations: A Synthesis Point of View , 2013, IEEE Signal Processing Magazine.

[19]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[20]  Thomas Grill,et al.  Inside the spectrogram: Convolutional Neural Networks in audio processing , 2017, 2017 International Conference on Sampling Theory and Applications (SampTA).

[21]  Helmut Bölcskei,et al.  Deep convolutional neural networks on cartoon functions , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[22]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[23]  Tuomas Virtanen,et al.  Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition , 2017, ArXiv.

[24]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Gerhard Widmer,et al.  A fully convolutional deep auditory model for musical chord recognition , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[26]  Tristan Jehan,et al.  Mining Labeled Data from Web-Scale Collections for Vocal Activity Detection in Music , 2017, ISMIR.

[27]  Jan Schlüter,et al.  Musical Onset Detection with Convolutional Neural Networks , 2013 .

[28]  Nicki Holighaus,et al.  Theory, implementation and applications of nonstationary Gabor frames , 2011, J. Comput. Appl. Math..

[29]  Philipp Grohs,et al.  Energy Propagation in Deep Convolutional Neural Networks , 2017, IEEE Transactions on Information Theory.

[30]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[31]  Irène Waldspurger Wavelet transform modulus : phase retrieval and scattering , 2017 .

[32]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[33]  Thomas Grill,et al.  A Framework for Invertible, Real-Time Constant-Q Transforms , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Monika Dörfler,et al.  Time-Frequency Analysis for Music Signals: A Mathematical Approach , 2001 .

[35]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[36]  Gerhard Widmer,et al.  Online, Loudness-Invariant Vocal Detection in Mixed Music Signals , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  José Luis Romero,et al.  MSE Estimates for Multitaper Spectral Estimation and Off-Grid Compressive Sensing , 2017, IEEE Transactions on Information Theory.

[38]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2014 .

[39]  Roland Badeau,et al.  Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.