Blending Acoustic and Language Model Predictions for Automatic Music Transcription

In this paper, we introduce a method for converting an input probabilistic piano roll (the output of a typical multipitch detection model) into a binary piano roll. The task is an important step for many automatic music transcription systems with the goal of converting an audio recording into some symbolic format. Our model has two components: an LSTM-based music language model (MLM) which can be trained on any MIDI data, not just that aligned with audio; and a blending model used to combine the probabilities of the MLM with those of the input probabilistic piano roll given by an acoustic multi-pitch detection model, which must be trained on (a comparably small amount of) aligned data. We use scheduled sampling to make the MLM robust to noisy sequences during testing. We analyze the performance of our model on the MAPS dataset using two different timesteps (40ms and 16th-note), comparing it against a strong baseline hidden Markov model with a training method not used before for the task to our knowledge. We report a statistically significant improvement over HMM decoding in terms of notewise F-measure with both timesteps, with 16th note timesteps improving further compared to 40ms timesteps.

[1]  Daniel P. W. Ellis,et al.  Multi-voice polyphonic music transcription using eigeninstruments , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[2]  David Temperley,et al.  A Unified Probabilistic Model for Polyphonic Music Analysis , 2009 .

[3]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[4]  Gerhard Widmer,et al.  Improved Chord Recognition by Combining Duration and Harmonic Language Models , 2018, ISMIR.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Adrien Ycart,et al.  A-MAPS: Augmented MAPS Dataset with Rhythm and Key Annotations , 2018 .

[8]  Qi Wang,et al.  Polyphonic Piano Transcription with a Note-Based Music Language Model , 2018 .

[9]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[10]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[11]  Emmanouil Benetos,et al.  Polyphonic Music Sequence Transduction with Meter-Constrained LSTM Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[13]  Valentin Emiya,et al.  Perceptually-Based Evaluation of the Errors Usually Made When Automatically Transcribing Music , 2008, ISMIR.

[14]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[15]  Mark Steedman,et al.  Multi-Pitch Detection and Voice Assignment for A Cappella Recordings of Multiple Singers , 2017, ISMIR.

[16]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  F. J. Cañadas Quesada,et al.  A Multiple-F0 Estimation Approach Based on Gaussian Spectral Modelling for Polyphonic Music Transcription , 2010 .

[18]  Yoshua Bengio,et al.  High-dimensional sequence transduction , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[20]  Guillaume Lemaitre,et al.  Real-time Polyphonic Music Transcription with Non-negative Matrix Factorization and Beta-divergence , 2010, ISMIR.

[21]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[22]  Florian Krebs,et al.  Joint Beat and Downbeat Tracking with Recurrent Neural Networks , 2016, ISMIR.

[23]  Qiao Wang,et al.  Improving Note Segmentation in Automatic Piano Music Transcription Systems with a Two-State Pitch-Wise HMM Method , 2017, ISMIR.

[24]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[25]  Adrien Ycart,et al.  A Study on LSTM Networks for Polyphonic Music Sequence Modelling , 2017, ISMIR.

[26]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.