Music Classification using an Improved CRNN with Multi-Directional Spatial Dependencies in Both Time and Frequency Dimensions

In music classification tasks, Convolutional Recurrent Neural Network (CRNN) has achieved state-of-the-art performance on several data sets. However, the current CRNN technique only uses RNN to extract spatial dependency of music signal in its time dimension but not its frequency dimension. We hypothesize the latter can be additionally exploited to improve classification performance. In this paper, we propose an improved technique called CRNN in Time and Frequency dimensions (CRNN-TF), which captures spatial dependencies of music signal in both time and frequency dimensions in multiple directions. Experimental studies on three real-world music data sets show that CRNN-TF consistently outperforms CRNN and several other state-of-the-art deep learning-based music classifiers. Our results also suggest CRNN-TF is transferable on small music data sets via the fine-tuning technique.

[1]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Mark Shtern,et al.  Transfer learning in neural networks: an experience report , 2017, CASCON.

[3]  Marcel A. J. van Gerven,et al.  Brains on Beats , 2016, NIPS.

[4]  Gang Wang,et al.  Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Lonce L. Wyse,et al.  Audio Spectrogram Representations for Processing with Convolutional Neural Networks , 2017, ArXiv.

[6]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[7]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[8]  Bob L. Sturm The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use , 2013, ArXiv.

[9]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[10]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[11]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[12]  Douglas Eck,et al.  Automatic Identification of Instrument Classes in Polyphonic and Poly-Instrument Audio , 2009, ISMIR.

[13]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Lei Wang,et al.  Transfer Learning for Music Classification and Regression Tasks Using Artist Tags , 2020 .

[15]  Lei Wang,et al.  Convolutional Recurrent Neural Networks for Text Classification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[16]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[17]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[21]  Xavier Serra,et al.  Transfer Learning of Artist Group Factors to Musical Genre Classification , 2018, WWW.

[22]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Xavier Serra,et al.  Timbre analysis of music audio signals with convolutional neural networks , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[24]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[25]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Juhan Nam,et al.  Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging , 2017, IEEE Signal Processing Letters.

[27]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.