Context-Dependent Piano Music Transcription With Convolutional Sparse Coding

This paper presents a novel approach to automatic transcription of piano music in a context-dependent setting. This approach employs convolutional sparse coding to approximate the music waveform as the summation of piano note waveforms (dictionary elements) convolved with their temporal activations (onset transcription). The piano note waveforms are pre-recorded for the specific piano to be transcribed in the specific environment. During transcription, the note waveforms are fixed and their temporal activations are estimated and post-processed to obtain the pitch and onset transcription. This approach works in the time domain, models temporal evolution of piano notes, and estimates pitches and onsets simultaneously in the same framework. Experiments show that it significantly outperforms a state-of-the-art music transcription method trained in the same context-dependent setting, in both transcription accuracy and time precision, in various scenarios including synthetic, anechoic, noisy, and reverberant environments.

[1]  Anssi Klapuri,et al.  Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music , 2008, Computer Music Journal.

[2]  Daniel Scharstein,et al.  AUTOMATIC MUSIC TRANSCRIPTION , 2018 .

[3]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[4]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[7]  Tillman Weyde,et al.  A hybrid recurrent neural network for music transcription , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Simon Dixon,et al.  Modelling the decay of piano sounds , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  José Manuel Iñesta Quereda,et al.  Multiple fundamental frequency estimation using Gaussian smoothness , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[11]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Anssi Klapuri,et al.  Multiple fundamental frequency estimation based on harmonicity and spectral smoothness , 2003, IEEE Trans. Speech Audio Process..

[13]  James A. Moorer,et al.  On the Transcription of Musical Sound by Computer , 2016 .

[14]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[15]  Mark B. Sandler,et al.  Automatic Piano Transcription Using Frequency and Time-Domain Information , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[17]  Christopher Raphael,et al.  Automatic Transcription of Piano Music , 2002, ISMIR.

[18]  Simon J. Godsill,et al.  Multiple Pitch Estimation Using Non-Homogeneous Poisson Processes , 2011, IEEE Journal of Selected Topics in Signal Processing.

[19]  Juhan Nam,et al.  A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations , 2011, ISMIR.

[20]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[21]  Hideo Suzuki,et al.  Acoustics of pianos , 1990 .

[22]  Hirokazu Kameoka,et al.  Specmurt Analysis of Polyphonic Music Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Mark D. Plumbley,et al.  Structured sparsity for automatic music transcription , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Zhiyao Duan,et al.  Piano music transcription modeling note temporal evolution , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[27]  Perry R. Cook,et al.  Music, cognition, and computerized sound: an introduction to psychoacoustics , 1999 .

[28]  Tuomas Virtanen,et al.  Multichannel audio upmixing based on non-negative tensor factorization representation , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[29]  Mark D. Plumbley,et al.  Sparse representations of polyphonic music , 2006, Signal Process..

[30]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[31]  Mike E. Davies,et al.  Sparse and shift-Invariant representations of music , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[33]  Tillman Weyde,et al.  Template Adaptation for Improving Automatic Music Transcription , 2014, ISMIR.

[34]  Alenka Kavcic,et al.  Neural Networks for Note Onset Detection in Piano Music , 2002 .

[35]  Brendt Wohlberg,et al.  Efficient Algorithms for Convolutional Sparse Representations , 2016, IEEE Transactions on Image Processing.

[36]  Mark D. Plumbley,et al.  A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..

[38]  Ping-Keng Jao,et al.  Informed monaural source separation of music based on convolutional sparse coding , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[40]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Axel Röbel,et al.  Multiple Fundamental Frequency Estimation and Polyphony Inference of Polyphonic Music Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Giovanni Costantini,et al.  Event based transcription system for polyphonic piano music , 2009, Signal Process..

[43]  Alenka Kavcic,et al.  On detecting note onsets in piano music , 2002, 11th IEEE Mediterranean Electrotechnical Conference (IEEE Cat. No.02CH37379).

[44]  Yi-Hsuan Yang,et al.  Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Bhiksha Raj,et al.  A Probabilistic Latent Variable Model for Acoustic Modeling , 2006 .

[46]  Karin Dressler MULTIPLE FUNDAMENTAL FREQUENCY EXTRACTION FOR MIREX 2012 , 2011 .

[47]  Gerhard Widmer,et al.  Local Group Delay Based Vibrato and Tremolo Suppression for Onset Detection , 2013, ISMIR.

[48]  Mark D. Plumbley,et al.  Polyphonic piano transcription using non-negative Matrix Factorisation with group sparsity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Prashant Parikh A Theory of Communication , 2010 .

[50]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  David Barber,et al.  A generative model for music transcription , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Mark D. Plumbley,et al.  Polyphonic music transcription by non-negative sparse coding of power spectra , 2004 .

[53]  Daniel P. W. Ellis,et al.  Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments , 2011, IEEE Journal of Selected Topics in Signal Processing.

[54]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[55]  Brendt Wohlberg,et al.  Efficient convolutional sparse coding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[57]  M. Davy,et al.  Bayesian analysis of polyphonic western tonal music. , 2006, The Journal of the Acoustical Society of America.

[58]  Simon J. Godsill,et al.  Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[59]  Tom Barker,et al.  Non-negative tensor factorisation of modulation spectrograms for monaural sound source separation , 2013, INTERSPEECH.

[60]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[61]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Ray Meddis,et al.  Virtual pitch and phase sensitivity of a computer model of the auditory periphery , 1991 .

[63]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[64]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[65]  Zhiyao Duan,et al.  Note-level Music Transcription by Maximum Likelihood Sampling , 2014, ISMIR.

[66]  Perry R. Cook,et al.  Music, Cognition, and Computerized Sound , 1999 .

[67]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[68]  Axel Röbel,et al.  Multiple fundamental frequency estimation of polyphonic music signals , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[69]  Dan Klein,et al.  Unsupervised Transcription of Piano Music , 2014, NIPS.

[70]  Anders P. Eriksson,et al.  Fast Convolutional Sparse Coding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Brendt Wohlberg,et al.  Piano music transcription with fast convolutional sparse coding , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).