Learning Features of Music from Scratch

This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.

[1]  Christopher Raphael,et al.  Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Nicola Orio,et al.  Alignment of Monophonic and Polyphonic Music to a Score , 2001, ICMC.

[3]  George Tzanetakis,et al.  Polyphonic audio matching and alignment for music retrieval , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[4]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[5]  Daniel P. W. Ellis,et al.  Ground-truth transcriptions of real music from force-aligned MIDI syntheses , 2003, ISMIR.

[6]  Xavier Rodet,et al.  Improving polyphonic and poly-instrumental music to score alignment , 2003, ISMIR.

[7]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[8]  Ronald W. Schafer,et al.  Introduction to Digital Speech Processing , 2007, Found. Trends Signal Process..

[9]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Christopher Harte,et al.  Towards automatic extraction of harmony information from music signals , 2010 .

[11]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Roger B. Dannenberg,et al.  Understanding Features and Distance Functions for Music Sequence Alignment , 2010, ISMIR.

[13]  Gert R. G. Lanckriet,et al.  Learning Multi-modal Similarity , 2010, J. Mach. Learn. Res..

[14]  Simon Dixon,et al.  Joint Multi-Pitch Detection Using Harmonic Envelope Estimation for Polyphonic Music Transcription , 2011, IEEE Journal of Selected Topics in Signal Processing.

[15]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[16]  Feng Niu,et al.  Million Song Dataset Challenge ! , 2012 .

[17]  Gaël Richard,et al.  Learning Optimal Features for Polyphonic Audio-to-Score Alignment , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[19]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[20]  Damien Garreau,et al.  Metric Learning for Temporal Sequence Alignment , 2014, NIPS.

[21]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[22]  Dan Klein,et al.  Unsupervised Transcription of Piano Music , 2014, NIPS.

[23]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Vidhyasaharan Sethu,et al.  An Iterative Multi Range Non-Negative Matrix Factorization Algorithm for Polyphonic Music Transcription , 2015, ISMIR.

[25]  Daniel P. W. Ellis,et al.  Large-Scale Content-Based Matching of MIDI and Audio Files , 2015, ISMIR.

[26]  Meinard Müller,et al.  Let it Bee - Towards NMF-Inspired Audio Mosaicing , 2015, ISMIR.

[27]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Gerhard Widmer,et al.  Feature Learning for Chord Recognition: The Deep Chroma Extractor , 2016, ISMIR.

[30]  Heiga Zen,et al.  Directly modeling voiced and unvoiced components in speech waveforms by neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Francis Bach,et al.  A weakly-supervised discriminative model for audio-to-score alignment , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[33]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[34]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[35]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[36]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[37]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .