Towards Explainable Convolutional Features for Music Audio Modeling

Audio signals are often represented as spectrograms and treated as 2D images. In this light, deep convolutional architectures are widely used for music audio tasks even though these two data types have very different structures. In this work, we attempt to “open the black-box” on deep convolutional models to inform future architectures for music audio tasks, and explain the excellent performance of deep convolutions that model spectrograms as 2D images. To this end, we expand recent explainability discussions in deep learning for natural image data to music audio data through systematic experiments using the deep features learned by various convolutional architectures. We demonstrate that deep convolutional features perform well across various target tasks, whether or not they are extracted from deep architectures originally trained on that task. Additionally, deep features exhibit high similarity to hand-crafted wavelet features, whether the deep features are extracted from a trained or untrained model.

[1]  Cynthia Rudin,et al.  This Looks Like That: Deep Learning for Interpretable Image Recognition , 2018 .

[2]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[3]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[5]  Dumitru Erhan,et al.  A Benchmark for Interpretability Methods in Deep Neural Networks , 2018, NeurIPS.

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Chuang Gan,et al.  Deep Audio Priors Emerge From Harmonic Convolutional Networks , 2020, ICLR.

[8]  Harsh Verma,et al.  Convolutional Composer Classification , 2019, ISMIR.

[9]  Andrew K. Lampinen,et al.  What shapes feature representations? Exploring datasets, architectures, and training , 2020, NeurIPS.

[10]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[11]  Li Su,et al.  Learning Domain-Adaptive Latent Representations of Music Signals Using Variational Autoencoders , 2018, ISMIR.

[12]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[13]  Liwei Wang,et al.  Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation , 2018, NeurIPS.

[14]  Shenglan Liu,et al.  Bottom-up broadcast neural network for music genre classification , 2019, Multimedia Tools and Applications.

[15]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[16]  James Zou,et al.  Towards Automatic Concept-based Explanations , 2019, NeurIPS.

[17]  Scott Lundberg,et al.  Understanding Global Feature Contributions With Additive Importance Measures , 2020, NeurIPS.

[18]  Hierarchical multidimensional scaling for the comparison of musical performance styles , 2020, 2004.13870.

[19]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[20]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[21]  Meinard Müller,et al.  Fundamentals of Music Processing , 2015, Springer International Publishing.

[22]  Kamalesh Palanisamy,et al.  Rethinking CNN Models for Audio Classification , 2020, ArXiv.

[23]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[26]  Hod Lipson,et al.  Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.

[27]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[28]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[29]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[30]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[31]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.