An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts

In this work, we present an end-to-end framework for audio-to-score transcription. To the best of our knowledge, this is the first automatic music transcription approach which obtains directly a symbolic score from audio, instead of performing separate stages for piano-roll estimation (pitch detection and note tracking), meter detection or key estimation. The proposed method is based on a Convolutional Recurrent Neural Network architecture directly trained with pairs of spectrograms and their corresponding symbolic scores in Western notation. Unlike standard pitch estimation methods, the proposed architecture does not need the music symbols to be aligned with their audio frames thanks to a Connectionist Temporal Classification loss function. Training and evaluation were performed using a large dataset of short monophonic scores (incipits) from the RISM collection, that were synthesized to get the ground-truth data. Although there is still room for improvement, most musical symbols were correctly detected and the evaluation results validate the proposed approach. We believe that this end-to-end framework opens new avenues for automatic music transcription.

[1]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[2]  Wonyong Sung,et al.  Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification , 2015, ArXiv.

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Eita Nakamura,et al.  Rhythm Transcription of Polyphonic Piano Music Based on Merged-Output HMM for Multiple Voices , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[6]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[7]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[8]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[12]  Christopher Raphael,et al.  Music score alignment and computer accompaniment , 2006, CACM.

[13]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[14]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[16]  Gerhard Widmer,et al.  An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music Transcription Systems , 2017, Semantic Audio.

[17]  Eita Nakamura,et al.  Towards Complete Polyphonic Music Transcription: Integrating Multi-Pitch Detection and Rhythm Quantization , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  José Manuel Iñesta Quereda,et al.  Efficient methods for joint estimation of multiple fundamental frequencies in music signals , 2012, EURASIP Journal on Advances in Signal Processing.

[19]  Colin Raffel,et al.  Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching , 2016 .

[20]  Florian Metze,et al.  Comparison of Decoding Strategies for CTC Acoustic Models , 2017, INTERSPEECH.

[21]  Axel Röbel,et al.  Multiple fundamental frequency estimation of polyphonic music signals , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Remco C. Veltkamp,et al.  A Ground Truth For Half A Million Musical Incipits , 2005, J. Digit. Inf. Manag..

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[25]  Peter Knees,et al.  Drum Transcription via Joint Beat and Drum Modeling Using Convolutional Recurrent Neural Networks , 2017, ISMIR.

[26]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[27]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[28]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[29]  Nicola Orio,et al.  Score Following: State of the Art and New Developments , 2003, NIME.

[30]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  Qi Wang,et al.  Polyphonic Piano Transcription with a Note-Based Music Language Model , 2018 .

[33]  Eita Nakamura,et al.  Note Value Recognition for Piano Transcription Using Markov Random Fields , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Emmanouil Benetos,et al.  A supervised classification approach for note tracking in polyphonic piano transcription , 2018 .

[35]  Zaïd Harchaoui,et al.  Invariances and Data Augmentation for Supervised Music Transcription , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).