Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (∼0.1 ms to ∼100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (≈3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

[1]  Gerhard Widmer,et al.  Deep Polyphonic ADSR Piano Note Transcription , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Douglas Eck,et al.  An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation , 2018, ArXiv.

[3]  Karen Simonyan,et al.  The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[4]  Brian Kulis,et al.  Conditioning Deep Generative Raw Audio Models for Structured Automatic Music , 2018, ISMIR.

[5]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[8]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[9]  Juan Pablo Bello,et al.  PySOX: Leveraging the Audio Signal Processing Power of SOX in Python , 2016 .

[10]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[11]  Vesa Välimäki,et al.  Fifty Years of Artificial Reverberation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Bertrand David,et al.  MAPS - A piano database for multipitch estimation and automatic transcription of music , 2010 .

[13]  Federico Fontana,et al.  A Modal-Based Real-Time Piano Synthesizer , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[15]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[16]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[17]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .