An augmented lagrangian method for piano transcription using equal loudness thresholding and lstm-based decoding

A central goal in automatic music transcription is to detect individual note events in music recordings. An important variant is instrument-dependent music transcription where methods can use calibration data for the instruments in use. However, despite the additional information, results rarely exceed an f-measure of 80%. As a potential explanation, the transcription problem can be shown to be badly conditioned and thus relies on appropriate regularization. A recently proposed method employs a mixture of simple, convex regularizers (to stabilize the parameter estimation process) and more complex terms (to encourage more meaningful structure). In this paper, we present two extensions to this method. First, we integrate a computational loudness model to better differentiate real from spurious note detections. Second, we employ (Bidirectional) Long Short Term Memory networks to re-weight the likelihood of detected note constellations. Despite their simplicity, our two extensions lead to a drop of about 35% in note error rate compared to the state-of-the-art.

[1]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[2]  Mike E. Davies,et al.  Normalized Iterative Hard Thresholding: Guaranteed Stability and Performance , 2010, IEEE Journal of Selected Topics in Signal Processing.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Mark D. Plumbley,et al.  A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[7]  Anssi Klapuri,et al.  Score-informed transcription for automatic piano tutoring , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[8]  Brendt Wohlberg,et al.  Piano Transcription With Convolutional Sparse Lateral Inhibition , 2017, IEEE Signal Processing Letters.

[9]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[10]  Tillman Weyde,et al.  Automatic transcription of pitched and unpitched sounds from polyphonic music , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[12]  Anssi Klapuri,et al.  Automatic Music Transcription: Breaking the Glass Ceiling , 2012, ISMIR.

[13]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[14]  Mark B. Sandler,et al.  Score-Informed Identification of Missing and Extra Notes in Piano Recordings , 2016, ISMIR.

[15]  James A. Moorer,et al.  On the Transcription of Musical Sound by Computer , 2016 .

[16]  Jürgen Schmidhuber,et al.  Finding temporal structure in music: blues improvisation with LSTM recurrent networks , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[17]  Mark B. Sandler,et al.  Piano Transcription in the Studio Using an Extensible Alternating Directions Framework , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[19]  Gautham J. Mysore,et al.  Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation , 2012, ICML.

[20]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[21]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  B. Moore,et al.  A Model of Loudness Applicable to Time-Varying Sounds , 2002 .

[23]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[24]  Mark D. Plumbley,et al.  Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Tillman Weyde,et al.  A hybrid recurrent neural network for music transcription , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tillman Weyde,et al.  An efficient shift-invariant model for polyphonic music transcription , 2013 .