One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep learning in an effective, but also efficient manner, deep transfer learning has become a common approach. In this approach, it is possible to reuse the output of a pre-trained neural network as the basis for a new learning task. The underlying hypothesis is that if the initial and new learning tasks show commonalities and are applied to the same type of input data (e.g., music audio), the generated deep representation of the data is also informative for the new task. Since, however, most of the networks used to generate deep representations are trained using a single initial learning source, their representation is unlikely to be informative for all possible future tasks. In this paper, we present the results of our investigation of what are the most important factors to generate deep representations for the data and learning tasks in the music domain. We conducted this investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations. We then validate these representations considering multiple target datasets for evaluation. The results of our experiments yield several insights into how to approach the design of methods for learning widely deployable deep data representations in the music domain.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[3]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Geoffroy Peeters,et al.  Scale and shift invariant time/frequency representation using auditory statistics: Application to rhythm description , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[6]  Matthew E. P. Davies,et al.  Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity , 2013, ISMIR.

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Yi-Hsuan Yang,et al.  Generating Music Medleys via Playing Music Puzzle Games , 2017, AAAI.

[9]  Xavier Bresson,et al.  FMA: A Dataset for Music Analysis , 2016, ISMIR.

[10]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Tom Mitchell,et al.  Learning to Tag using Noisy Labels , 2010 .

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[17]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[18]  J. Russell,et al.  The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology , 2005, Development and Psychopathology.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[22]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[23]  Kyogu Lee,et al.  Learning Temporal Features Using a Deep Neural Network and its Application to Music Genre Classification , 2016, ISMIR.

[24]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[25]  Jieping Ye,et al.  Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis , 2015, IEEE Transactions on Big Data.

[26]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Arthur Flexer,et al.  Basic Filters for Convolutional Neural Networks: Training or Design? , 2017, ArXiv.

[29]  Jordi Janer,et al.  A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals , 2012, ISMIR.

[30]  Xavier Serra,et al.  Experimenting with musically motivated convolutional neural networks , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[31]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[32]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[33]  Arthur Flexer,et al.  Basic filters for convolutional neural networks applied to music: Training or design? , 2017, Neural Computing and Applications.

[34]  Chun-Liang Li,et al.  One Network to Solve Them All — Solving Linear Inverse Problems Using Deep Projection Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Geoffrey E. Hinton,et al.  Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  George Tzanetakis,et al.  An experimental comparison of audio tempo induction algorithms , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Bartlomiej Stasiak,et al.  Analysis of time-frequency representations for musical onset detection with convolutional neural network , 2016, 2016 Federated Conference on Computer Science and Information Systems (FedCSIS).

[38]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[39]  Daniel P. W. Ellis,et al.  Content-Aware Collaborative Music Recommendation Using Pre-trained Neural Networks , 2015, ISMIR.

[40]  Yichuan Zhang,et al.  Advances in Neural Information Processing Systems 25 , 2012 .

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[43]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[44]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[45]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[46]  Yi-Hsuan Yang,et al.  1000 songs for emotional analysis of music , 2013, CrowdMM '13.

[47]  Emilia Gómez,et al.  Monoaural Audio Source Separation Using Deep Convolutional Neural Networks , 2017, LVA/ICA.

[48]  Mark Sandler,et al.  Transfer Learning for Music Classification and Regression Tasks , 2017, ISMIR.

[49]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[51]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[52]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[53]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[55]  Paul Lamere,et al.  Social Tagging and Music Information Retrieval , 2008 .

[56]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Benjamin Schrauwen,et al.  Transfer Learning by Supervised Pre-training for Audio-based Music Classification , 2014, ISMIR.

[58]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[60]  Lukasz Kaiser,et al.  One Model To Learn Them All , 2017, ArXiv.

[61]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[62]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[63]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[64]  Tom M. Mitchell,et al.  Learning to Tag from Open Vocabulary Labels , 2010, ECML/PKDD.

[65]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[66]  Yi-Hsuan Yang,et al.  Similarity Embedding Network for Unsupervised Sequential Pattern Learning by Playing Music Puzzle Games , 2017, ArXiv.

[67]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Tetsuya Takiguchi,et al.  Local-feature-map Integration Using Convolutional Neural Networks for Music Genre Classification , 2012, INTERSPEECH.

[70]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[71]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[72]  Jason Weston,et al.  Multi-Tasking with Joint Semantic Spaces for Large-Scale Music Annotation and Retrieval , 2011 .

[73]  Martha Larson,et al.  Mining mood-specific movie similarity with matrix factorization for context-aware recommendation , 2010 .

[74]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[75]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[76]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[77]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[78]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[79]  Joachim Bingel,et al.  Identifying beneficial task relations for multi-task learning in deep neural networks , 2017, EACL.

[80]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[81]  Hui Zhang,et al.  Convolutional neural network for robust pitch determination , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[83]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[84]  Bob L. Sturm The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval , 2013, ArXiv.

[85]  Bob L. Sturm The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use , 2013, ArXiv.

[86]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[87]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[88]  Bob L. Sturm The “Horse” Inside: Seeking Causes Behind the Behaviors of Music Content Analysis Systems , 2016, CIE.

[89]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[90]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[91]  Peter Goos,et al.  Optimal Design of Experiments: A Case Study Approach , 2011 .

[92]  Òscar Celma,et al.  Music Recommendation and Discovery - The Long Tail, Long Fail, and Long Play in the Digital Music Space , 2010 .

[93]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[94]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[95]  Kyunghyun Cho,et al.  A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging , 2017, 2018 26th European Signal Processing Conference (EUSIPCO).

[96]  Yoshua Bengio,et al.  High-dimensional sequence transduction , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[97]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[98]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[99]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[100]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[101]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.