Cross-Representation Transferability of Adversarial Attacks: From Spectrograms to Audio Waveforms

This paper shows the susceptibility of spectrogram-based audio classifiers to adversarial attacks and the transferability of such attacks to audio waveforms. Some commonly used adversarial attacks to images have been applied to Mel-frequency and short-time Fourier transform spectrograms, and such perturbed spectrograms are able to fool a 2D convolutional neural network (CNN). Such attacks produce perturbed spectrograms that are visually imperceptible by humans. Furthermore, the audio waveforms reconstructed from the perturbed spectrograms are also able to fool a 1D CNN trained on the original audio. Experimental results on a dataset of western music have shown that the 2D CNN achieves up to 81.87% of mean accuracy on legitimate examples and such performance drops to 12.09% on adversarial examples. Likewise, the 1D CNN achieves up to 78.29% of mean accuracy on original audio samples and such performance drops to 27.91% on adversarial audio waveforms reconstructed from the perturbed spectrograms.

[1]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Constantine Kotropoulos,et al.  Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations , 2009, ISMIR.

[3]  Alessandro Lameiras Koerich,et al.  A Machine Learning Approach to Automatic Music Genre Classification , 2008, Journal of the Brazilian Computer Society.

[4]  Andreas Rauber,et al.  On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-Western and ethnic music collections , 2010, Signal Process..

[5]  Patrick Cardinal,et al.  Universal Adversarial Audio Perturbations , 2019, ArXiv.

[6]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.

[7]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Luiz Eduardo Soares de Oliveira,et al.  Music genre classification using LBP textural features , 2012, Signal Process..

[9]  Patrick Cardinal,et al.  A Robust Approach for Securing Audio Classification Against Adversarial Attacks , 2019, IEEE Transactions on Information Forensics and Security.

[10]  Patrick Cardinal,et al.  End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network , 2019, Expert Syst. Appl..

[11]  Lars Lundberg,et al.  Classifying environmental sounds using image recognition networks , 2017, KES.

[12]  Patrick Cardinal,et al.  Detection of Adversarial Attacks and Characterization of Adversarial Subspace , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[14]  Bob L. Sturm An analysis of the GTZAN music genre dataset , 2012, MIRUM '12.

[15]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[16]  Xavier Serra,et al.  Randomly Weighted CNNs for (Music) Audio Classification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Alessandro L. Koerich Improving the Reliability of Music Genre Classification using Rejection and Verification , 2013, ISMIR.

[18]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[20]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[21]  Luiz S. Oliveira,et al.  Music genre recognition using spectrograms , 2011, 2011 18th International Conference on Systems, Signals and Image Processing.

[22]  Alessandro Lameiras Koerich,et al.  Automatic classification of audio data , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[23]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[24]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[25]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[26]  Dan Iter,et al.  Generating Adversarial Examples for Speech Recognition , 2017 .

[27]  Paul Rad,et al.  A deep learning approach for mapping music genres , 2017, 2017 12th System of Systems Engineering Conference (SoSE).

[28]  Patrick Cardinal,et al.  Unsupervised feature learning for environmental sound classification using Weighted Cycle-Consistent Generative Adversarial Network , 2019, Appl. Soft Comput..

[29]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[30]  V. Tiwari MFCC and its applications in speaker recognition , 2010 .

[31]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[33]  Dan Stowell,et al.  Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning , 2014, PeerJ.