Reliable Local Explanations for Machine Listening

One way to analyse the behaviour of machine learning models is through local explanations that highlight input features that maximally influence model predictions. Sensitivity analysis, which involves analysing the effect of input perturbations on model predictions, is one of the methods to generate local explanations. Meaningful input perturbations are essential for generating reliable explanations, but there exists limited work on what such perturbations are and how to perform them. This work investigates these questions in the context of machine listening models that analyse audio. Specifically, we use a state-of-the-art deep singing voice detection (SVD) model to analyse whether explanations from SoundLIME (a local explanation method) are sensitive to how the method perturbs model inputs. The results demonstrate that SoundLIME explanations are sensitive to the content in the occluded input regions. We further propose and demonstrate a novel method for quantitatively identifying suitable content type(s) for reliably occluding inputs of machine listening models. The results for the SVD model suggest that the average magnitude of input mel-spectrogram bins is the most suitable content type for temporal explanations.

[1]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[2]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Bob L. Sturm,et al.  “What are You Listening to?” Explaining Predictions of Deep Machine Listening Systems , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[4]  Bob L. Sturm,et al.  GAN-based Generation and Automatic Selection of Explanations for Neural Networks , 2019, ICLR 2019.

[5]  Bob L. Sturm,et al.  Analysing The Predictions Of a CNN-Based Replay Spoofing Detection System , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[6]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[7]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Mark B. Sandler,et al.  Explaining Deep Convolutional Neural Networks on Music Classification , 2016, ArXiv.

[9]  Gerhard Widmer,et al.  Online, Loudness-Invariant Vocal Detection in Mixed Music Signals , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[12]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[13]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[14]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[15]  Bob L. Sturm,et al.  Understanding a Deep Machine Listening Model Through Feature Inversion , 2018, ISMIR.

[16]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[17]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[18]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[19]  Dumitru Erhan,et al.  The (Un)reliability of saliency methods , 2017, Explainable AI.

[20]  Bob L. Sturm,et al.  Local Interpretable Model-Agnostic Explanations for Music Content Analysis , 2017, ISMIR.

[21]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[22]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[23]  Hiromasa Fujihara,et al.  Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music , 2011, ISMIR.

[24]  Chris Russell,et al.  Explaining Explanations in AI , 2018, FAT.

[25]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[26]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.