Randomly Weighted CNNs for (Music) Audio Classification

The computer vision literature shows that randomly weighted neural networks perform reasonably as feature extractors. Following this idea, we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks. We use features extracted from the embeddings of deep architectures as input to a classifier – with the goal to compare classification accuracies when using different randomly weighted architectures. By following this methodology, we run a comprehensive evaluation of the current architectures for audio classification, and provide evidence that the architectures alone are an important piece for resolving (music) audio problems using deep neural networks.

[1]  Xavier Serra,et al.  ISMIR 2004 Audio Description Contest , 2006 .

[2]  Sabu Emmanuel,et al.  ELM for the Classification of Music Genres , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Xavier Serra,et al.  Timbre analysis of music audio signals with convolutional neural networks , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[5]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[6]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[7]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[8]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[10]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[11]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[13]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[14]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[15]  Georg Holzmann,et al.  RESERVOIR COMPUTING: A POWERFUL BLACK-BOX FRAMEWORK FOR NONLINEAR AUDIO PROCESSING , 2009 .

[16]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[17]  Keunwoo Choi,et al.  DLR : Toward a deep learned rhythmic representation for music content analysis , 2017, ArXiv.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[20]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[21]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[22]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Danilo Comminiello,et al.  Music classification using extreme learning machines , 2013, 2013 8th International Symposium on Image and Signal Processing and Analysis (ISPA).

[24]  Alan Hanjalic,et al.  One deep music representation to rule them all? A comparative analysis of different representation learning strategies , 2018, Neural Computing and Applications.

[25]  Jesse Engel,et al.  Learning Multiscale Features Directly from Waveforms , 2016, INTERSPEECH.

[26]  Xavier Serra,et al.  Designing efficient architectures for modeling temporal features with convolutional neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[28]  Simone Scardapane,et al.  Semi-supervised Echo State Networks for Audio Classification , 2017, Cognitive Computation.

[29]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[30]  Zhihong Man,et al.  Automatic Han Chinese Folk Song Classification Using Extreme Learning Machines , 2012, Australasian Conference on Artificial Intelligence.

[31]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[33]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[34]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[35]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[38]  Been Kim,et al.  Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values , 2018, ICLR.

[39]  Xavier Serra,et al.  Score-Informed Syllable Segmentation for A Cappella Singing Voice with Convolutional Neural Networks , 2017, ISMIR.

[40]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[41]  Ning Chen,et al.  High-Level Music Descriptor Extraction Algorithm Based on Combination of Multi-Channel CNNs and LSTM , 2017, ISMIR.

[42]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[43]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[44]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[45]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Geoffroy Peeters,et al.  Scale and shift invariant time/frequency representation using auditory statistics: Application to rhythm description , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[47]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[48]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[50]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[51]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[52]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[53]  Xavier Serra,et al.  Audio to Score Matching by Combining Phonetic and Duration Information , 2017, ISMIR.