Learning Rhythm And Melody Features With Deep Belief Networks

Deep learning techniques provide powerful methods for the development of deep structured projections connecting multiple domains of data. But the fine-tuning of such networks for supervised problems is challenging, and many current approaches are therefore heavily reliant on pretraining, which consists of unsupervised processing on the input observation data. In previous work, we have investigated using magnitude spectra as the network observations, finding reasonable improvements over standard acoustic representations. However, in necessarily supervised problems such as music emotion recognition, there is no guarantee that the starting points for optimization are anywhere near optimal, as emotion is unlikely to be the most dominant aspect of the data. In this new work, we develop input representations using harmonic/percussive source separation designed to inform rhythm and melodic contour. These representations are beat synchronous, providing an event-driven representation, and potentially the ability to learn emotion informative representations from pre-training alone. In order to provide a large dataset for our pre-training experiments, we select a subset of 50,000 songs from the Million Song Dataset, and employ their 3060 second preview clips from 7digital to compute our custom feature representations.

[1]  D. Ellis Beat Tracking by Dynamic Programming , 2007 .

[2]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Youngmoo E. Kim,et al.  Prediction of Time-Varying Musical Mood Distributions Using Kalman Filtering , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[5]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[6]  Derry Fitzgerald,et al.  Harmonic/Percussive Separation Using Median Filtering , 2010 .

[7]  David Wessel,et al.  Analyzing Drum Patterns Using Conditional Deep Belief Networks , 2012, ISMIR.

[8]  Youngmoo E. Kim,et al.  Learning emotion-based acoustic features with deep belief networks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Douglas Eck,et al.  Building Musically-relevant Audio Features through Multiple Timescale Representations , 2012, ISMIR.

[10]  Jeffrey J. Scott,et al.  Feature Learning in Dynamic Environments: Modeling the Acoustic Structure of Musical Emotion , 2012, ISMIR.

[11]  Youngmoo E. Kim,et al.  Modeling Musical Emotion Dynamics with Conditional Random Fields , 2011, ISMIR.

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[14]  Thierry Bertin-Mahieux,et al.  Large-Scale Cover Song Recognition Using the 2D Fourier Transform Magnitude , 2012, ISMIR.

[15]  Youngmoo E. Kim,et al.  Feature selection for content-based, time-varying musical emotion regression , 2010, MIR '10.

[16]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[17]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[18]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[19]  Brandon G. Morton,et al.  A Comparative Study of Collaborative vs. Traditional Musical Mood Annotation , 2011, ISMIR.

[20]  Youngmoo E. Kim,et al.  Prediction of Time-varying Musical Mood Distributions from Audio , 2010, ISMIR.