Masked Conditional Neural Networks for Audio Classification

We present the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) designed for temporal signal recognition. The CLNN takes into consideration the temporal nature of the sound signal and the MCLNN extends upon the CLNN through a binary mask to preserve the spatial locality of the features and allows an automated exploration of the features combination analogous to hand-crafting the most relevant features for the recognition task. MCLNN has achieved competitive recognition accuracies on the GTZAN and the ISMIR2004 music datasets that surpass several state-of-the-art neural network based architectures and hand-crafted methods applied on both datasets.

[1]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Andreas Rauber,et al.  Evaluation of Feature Extractors and Psycho-Acoustic Transformations for Music Genre Classification , 2005, ISMIR.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[5]  Bob L. Sturm The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval , 2013, ArXiv.

[6]  Juan Pablo Bello Machine Listening of Music , 2014, Digital Da Vinci.

[7]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[8]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[9]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Douglas Eck,et al.  Scalable Genre and Tag Prediction with Spectral Covariance , 2010, ISMIR.

[11]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[12]  Geoffrey E. Hinton,et al.  Phone recognition using Restricted Boltzmann Machines , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[17]  Yannis Stylianou,et al.  Musical Genre Classification Using Nonnegative Matrix Factorization-Based Features , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[19]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[20]  Constantine Kotropoulos,et al.  Music Genre Classification: A Multilinear Approach , 2008, ISMIR.

[21]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[22]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[23]  Peter Knees,et al.  USING BLOCK-LEVEL FEATURES FOR GENRE CLASSIFICATION , TAG CLASSIFICATION AND MUSIC SIMILARITY ESTIMATION , 2010 .

[24]  Gerhard Widmer,et al.  Improvements of Audio-Based Music Similarity and Genre Classificaton , 2005, ISMIR.

[25]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[26]  Yann LeCun,et al.  Unsupervised Learning of Sparse Features for Scalable Audio Classification , 2011, ISMIR.

[27]  Simon Dixon,et al.  Improved music feature learning with deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Andreas Rauber,et al.  Improving Genre Classification by Combination of Audio and Symbolic Descriptors Using a Transcription Systems , 2007, ISMIR.

[29]  Peter Knees,et al.  On Rhythm and General Music Similarity , 2009, ISMIR.

[30]  Jyh-Shing Roger Jang,et al.  Music Genre Classification via Compressive Sampling , 2010, ISMIR.

[31]  Constantine Kotropoulos,et al.  Music genre classification via Topology Preserving Non-Negative Tensor Factorization and sparse representations , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Bob L. Sturm The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use , 2013, ArXiv.

[33]  Constantine Kotropoulos,et al.  Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations , 2009, ISMIR.