Feature learning and deep architectures: new directions for music informatics

As we look to advance the state of the art in content-based music informatics, there is a general sense that progress is decelerating throughout the field. On closer inspection, performance trajectories across several applications reveal that this is indeed the case, raising some difficult questions for the discipline: why are we slowing down, and what can we do about it? Here, we strive to address both of these concerns. First, we critically review the standard approach to music signal analysis and identify three specific deficiencies to current methods: hand-crafted feature design is sub-optimal and unsustainable, the power of shallow architectures is fundamentally limited, and short-time analysis cannot encode musically meaningful structure. Acknowledging breakthroughs in other perceptual AI domains, we offer that deep learning holds the potential to overcome each of these obstacles. Through conceptual arguments for feature learning and deeper processing architectures, we demonstrate how deep processing models are more powerful extensions of current methods, and why now is the time for this paradigm shift. Finally, we conclude with a discussion of current challenges and the potential impact to further motivate an exploration of this promising research area.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Roger B. Dannenberg,et al.  An On-Line Algorithm for Real-Time Accompaniment , 1984, ICMC.

[3]  John F. Kolen,et al.  Resonance and the Perception of Musical Meter , 1994, Connect. Sci..

[4]  Yoichi Muraoka,et al.  A Real-Time Beat Tracking System for Audio Signals , 1996, ICMC.

[5]  Eric D. Scheirer,et al.  Tempo and beat analysis of acoustic musical signals. , 1998, The Journal of the Acoustical Society of America.

[6]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[7]  Daniel P. W. Ellis,et al.  Chord Recognition and Segmentation Using EM-trained Hidden Markov Models , 2003 .

[8]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[9]  François Pachet,et al.  Automatic extraction of music descriptors from acoustic signals , 2004, ISMIR.

[10]  Mike E. Davies,et al.  A tutorial on audio transient detection , 2005 .

[11]  François Pachet,et al.  Recognizing Chords with EDS: Part One , 2005, CMMR.

[12]  Daniel P. W. Ellis,et al.  Song-Level Features and Support Vector Machines for Music Classification , 2005, ISMIR.

[13]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[15]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[17]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[18]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19]  Simon Dixon,et al.  Evaluation of the Audio Beat Tracking System BeatRoot , 2007 .

[20]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[21]  Mark B. Sandler,et al.  A Comparison of Timbral and Harmonic Music Segmentation Algorithms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[23]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[24]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25]  Laurent Daudet,et al.  Automatic Instrument Recognition in a Polyphonic Mixture Using Sparse Representations , 2007, ISMIR.

[26]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[27]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[28]  Mark B. Sandler,et al.  Music Information Retrieval Using Social Tags and Audio , 2009, IEEE Transactions on Multimedia.

[29]  Douglas Eck,et al.  Automatic Identification of Instrument Classes in Polyphonic and Poly-Instrument Audio , 2009, ISMIR.

[30]  Quoc V. Le,et al.  Tiled convolutional neural networks , 2010, NIPS.

[31]  Samy Bengio,et al.  Sound Retrieval and Ranking Using Sparse Auditory Representations , 2010, Neural Computation.

[32]  Simon Dixon,et al.  Simultaneous Estimation of Chords and Musical Context From Audio , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Meinard Müller,et al.  Chroma Toolbox: Matlab Implementations for Extracting Variants of Chroma-Based Audio Features , 2011, ISMIR.

[34]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[35]  Joakim Andén,et al.  Multiscale Scattering for Audio Classification , 2011, ISMIR.

[36]  Juan Pablo Bello,et al.  Non-Linear Semantic Embedding for Organizing Large Instrument Sample Libraries , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[37]  Juhan Nam,et al.  A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations , 2011, ISMIR.

[38]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[39]  Peter Grosche,et al.  Extracting Predominant Local Pulse Information From Music Recordings , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Juan Pablo Bello,et al.  A Feature Smoothing Method for Chord Recognition Using Recurrence Plots , 2011, ISMIR.

[41]  Daniel P. W. Ellis,et al.  Signal Processing for Music Analysis , 2011, IEEE Journal of Selected Topics in Signal Processing.

[42]  Parag Chordia,et al.  Predictive Tabla Modelling Using Variable-length Markov and Hidden Markov Models , 2011 .

[43]  Malcolm Slaney,et al.  Web-Scale Multimedia Analysis: Does Content Matter? , 2011, IEEE MultiMedia.

[44]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[45]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[46]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[47]  Gert R. G. Lanckriet,et al.  Hypergraph Models of Playlist Dialects , 2012, ISMIR.

[48]  Takuya Fujishima,et al.  A music retrieval system using chroma and pitch features based on conditional random fields , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Arthur Flexer,et al.  A MIREX Meta-analysis of Hubness in Audio Music Similarity , 2012, ISMIR.

[50]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[51]  Yann LeCun,et al.  Learning Invariant Feature Hierarchies , 2012, ECCV Workshops.

[52]  Thierry Bertin-Mahieux,et al.  Large-Scale Cover Song Recognition Using the 2D Fourier Transform Magnitude , 2012, ISMIR.

[53]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[54]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.