Towards end-to-end polyphonic music transcription: Transforming music audio directly to a score

We present a neural network model that learns to produce music scores directly from audio signals. Instead of employing commonplace processing steps, such as frequency transform front-ends, harmonicity and scale priors, or temporal pitch smoothing, we show that a neural network can learn such steps on its own when presented with the appropriate training data. We show how such a network can perform monophonic transcription with very high accuracy, and how it also generalizes well to transcribing polyphonic music.

[1]  Lale Akarun,et al.  Large scale polyphonic music transcription using randomized matrix decompositions , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[2]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[3]  Roland Badeau,et al.  ON AUDIO , SPEECH , AND LANGUAGE PROCESSING 1 Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription , 2013 .

[4]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[6]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[7]  Daniel P. W. Ellis,et al.  Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments , 2011, IEEE Journal of Selected Topics in Signal Processing.

[8]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[9]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[10]  James Anderson Moorer,et al.  On the segmentation and analysis of continuous musical sound by digital computer , 1975 .

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Mark B. Sandler,et al.  Automatic Piano Transcription Using Frequency and Time-Domain Information , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Brendt Wohlberg,et al.  Piano music transcription with fast convolutional sparse coding , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[17]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[18]  Han-Wen Nienhuys,et al.  LILYPOND, A SYSTEM FOR AUTOMATED MUSIC ENGRAVING , 2003 .