Piano Transcription in the Studio Using an Extensible Alternating Directions Framework

Given a musical audio recording, the goal of automatic music transcription is to determine a score-like representation of the piece underlying the recording. Despite significant interest within the research community, several studies have reported on a “glass ceiling” effect, an apparent limit on the transcription accuracy that current methods seem incapable of overcoming. In this paper, we explore how much this effect can be mitigated by focusing on a specific instrument class and making use of additional information on the recording conditions available in studio or home recording scenarios. In particular, exploiting the availability of single note recordings for the instrument in use, we develop a novel signal model employing variable-length spectro-temporal patterns as its central building blocks-tailored for pitched percussive instruments such as the piano. Temporal dependencies between spectral templates are modeled, resembling characteristics of factorial scaled hidden Markov models (FS-HMM) and other methods combining nonnegative matrix factorization with Markov processes. In contrast to FS-HMMs, our parameter estimation is developed in a global, relaxed form within the extensible alternating direction method of multipliers framework, which enables the systematic combination of basic regularizers propagating sparsity and local stationarity in note activity with more complex regularizers imposing temporal semantics. The proposed method achieves an f -measure of 93-95% for note onsets on pieces recorded on a Yamaha Disklavier (MAPS DB).

[1]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[3]  Anssi Klapuri,et al.  Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[5]  Mark D. Plumbley,et al.  A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..

[7]  James A. Moorer,et al.  On the Transcription of Musical Sound by Computer , 2016 .

[8]  Mark B. Sandler,et al.  Automatic Piano Transcription Using Frequency and Time-Domain Information , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Anssi Klapuri,et al.  Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[10]  Christopher Raphael,et al.  Automatic Transcription of Piano Music , 2002, ISMIR.

[11]  Simon Dixon,et al.  On the Computer Recognition of Solo Piano Music , 2000 .

[12]  Meinard Müller,et al.  Let it Bee - Towards NMF-Inspired Audio Mosaicing , 2015, ISMIR.

[13]  Mark D. Plumbley,et al.  Structured sparsity for automatic music transcription , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Maurice Charbit,et al.  Factorial Scaled Hidden Markov Model for polyphonic audio representation and source separation , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[15]  Daniel P. W. Ellis,et al.  Multi-voice polyphonic music transcription using eigeninstruments , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  Andy M. Yip,et al.  Recent Developments in Total Variation Image Restoration , 2004 .

[17]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[18]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[19]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[20]  Tuomas Virtanen,et al.  Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization , 2011, IEEE Journal of Selected Topics in Signal Processing.

[21]  D. M. Green,et al.  Intensity discrimination as a function of frequency and sensation level. , 1977, The Journal of the Acoustical Society of America.

[22]  Asuman Ozdaglar,et al.  Cooperative distributed multi-agent optimization , 2010, Convex Optimization in Signal Processing and Communications.

[23]  Emmanuel Vincent,et al.  Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Bhiksha Raj,et al.  Compositional Models for Audio Processing: Uncovering the structure of sound mixtures , 2015, IEEE Signal Processing Magazine.

[25]  S. Schwerman,et al.  The Physics of Musical Instruments , 1991 .

[26]  Stanley Osher,et al.  A Unified Primal-Dual Algorithm Framework Based on Bregman Iteration , 2010, J. Sci. Comput..

[27]  Ana M. Barbancho,et al.  Transcription of piano recordings , 2004 .

[28]  Axel Röbel,et al.  Multiple Fundamental Frequency Estimation and Polyphony Inference of Polyphonic Music Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[30]  Emmanuel Vincent,et al.  Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Dimitri P. Bertsekas,et al.  On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators , 1992, Math. Program..

[32]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Gautham J. Mysore,et al.  Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation , 2012, ICML.

[34]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[35]  Derry Fitzgerald,et al.  Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation , 2008, Comput. Intell. Neurosci..

[36]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[38]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[39]  Ali Taylan Cemgil,et al.  Bayesian Music Transcription , 1997 .

[40]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[41]  Hirokazu Kameoka,et al.  Bayesian nonparametric spectrogram modeling based on infinite factorial infinite hidden Markov model , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[42]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[43]  Francisco Fernández de Vega,et al.  Automatic Transcription of Polyphonic Piano Music Using Genetic Algorithms, Adaptive Spectral Envelope Modeling, and Dynamic Noise Level Estimation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Mark D. Plumbley,et al.  Unsupervised analysis of polyphonic music by sparse coding , 2006, IEEE Transactions on Neural Networks.

[45]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[46]  Tillman Weyde,et al.  An RNN-based Music Language Model for Improving Automatic Music Transcription , 2014, ISMIR.

[47]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[48]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Anssi Klapuri,et al.  Multi-Template Shift-Variant Non-Negative Matrix Deconvolution for Semi-Automatic Music Transcription , 2012, ISMIR.

[51]  Roland Badeau,et al.  Adaptive harmonic time-frequency decomposition of audio using shift-invariant PLCA , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Tillman Weyde,et al.  Automatic transcription of pitched and unpitched sounds from polyphonic music , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Emmanuel Vincent,et al.  A General Flexible Framework for the Handling of Prior Information in Audio Source Separation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[55]  Simon Dixon,et al.  Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model. , 2013, The Journal of the Acoustical Society of America.