Pre-trained Deep Neural Network using Sparse Autoencoders and Scattering Wavelet Transform for Musical Genre Recognition

Research described in this paper tries to combine the approach of Deep Neural Networks (DNN) with the novel audio features extracted using the Scattering Wavelet Transform (SWT) for classifying musical genres. The SWT uses a sequence of Wavelet Transforms to compute the modulation spectrum coefficients of multiple orders, which has already shown to be promising for this task. The DNN in this work uses pre-trained layers using Sparse Autoencoders (SAE). Data obtained from the Creative Commons website jamendo.com is used to boost the well-known GTZAN database, which is a standard benchmark for this task. The final classifier is tested using a 10-fold cross validation to achieve results similar to other state-of-the-art approaches.

[1]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[2]  Bob L. Sturm The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use , 2013, ArXiv.

[3]  Constantine Kotropoulos,et al.  Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations , 2009, ISMIR.

[4]  Naoaki Okazaki,et al.  Limited-memory BFGS Optimization , 2014 .

[5]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[6]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[7]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[8]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[9]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[10]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[11]  Mariusz Kleć,et al.  Unsupervised Feature Pre-training of the Scattering Wavelet Transform for Musical Genre Recognition , 2014 .

[12]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[13]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[14]  Xu Chen,et al.  Music genre classification using multiscale scattering and sparse representations , 2013, 2013 47th Annual Conference on Information Sciences and Systems (CISS).

[15]  Jyh-Shing Roger Jang,et al.  Music Genre Classification via Compressive Sampling , 2010, ISMIR.

[16]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[17]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[18]  Thomas Hofmann,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2007 .

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  Simon Dixon,et al.  Improved music feature learning with deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Anil C. Kokaram,et al.  A Wavelet Packet representation of audio signals for music genre classification using different ensemble and feature selection techniques , 2003, MIR '03.

[22]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[23]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.