WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Automatic Music Transcription (AMT) is a fundamental problem in Music Information Retrieval (MIR). The challenge is to translate an audio sequence to a symbolic representation of music. Recently, convolutional neural networks (CNNs) have been successfully applied to the task by translating frames of audio [44, 46]. However, those models can by their nature not model temporal relations and long time dependencies. Furthermore, it is extremely labor intense to get annotations for supervised learning in this setting. We propose a model that overcomes all these problems. The convolutional sequence to sequence (Cseq2seq) model applies a CNN to learn a low dimensional representation of audio frames and a sequential model to translate these learned features to a symbolic representation directly. Our approach has three advantages over other methods: (i) extracting audio frame representations and learning the sequential model is jointly trained end-to-end, (ii) the recurrent model can capture temporal features in musical pieces in order to improve transcription, and (iii) our model learns from entire sequences as opposed to temporally accurately annotated onsets and offsets for each note thus making it possible to train on large already existing corpora of music. For the purpose of testing our method we created our own dataset of 17K monophonic songs and respective MusicXML files. Initial experiments proof the validity of our approach.

[1]  Mert Bay,et al.  Second Fiddle is Important Too: Pitch Tracking Individual Voices in Polyphonic Music , 2012, ISMIR.

[2]  Mark D. Plumbley,et al.  Structured sparsity for automatic music transcription , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Simon J. Godsill,et al.  Polyphonic pitch tracking using joint Bayesian estimation of multiple frame parameters , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[4]  Fabrizio Argenti,et al.  Automatic Transcription of Polyphonic Music Based on the Constant-Q Bispectral Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Guillaume Lemaitre,et al.  Real-time Polyphonic Music Transcription with Non-negative Matrix Factorization and Beta-divergence , 2010, ISMIR.

[6]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[7]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[9]  Roland Badeau,et al.  ON AUDIO , SPEECH , AND LANGUAGE PROCESSING 1 Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription , 2013 .

[10]  Simon J. Godsill,et al.  Multiple Pitch Estimation Using Non-Homogeneous Poisson Processes , 2011, IEEE Journal of Selected Topics in Signal Processing.

[11]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[12]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[13]  Thomas Grill,et al.  Boundary Detection in Music Structure Analysis using Convolutional Neural Networks , 2014, ISMIR.

[14]  Tillman Weyde,et al.  Template Adaptation for Improving Automatic Music Transcription , 2014, ISMIR.

[15]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Changshui Zhang,et al.  Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Xavier Serra,et al.  Essentia: An Audio Analysis Library for Music Information Retrieval , 2013, ISMIR.

[18]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Roland Badeau,et al.  Automatic transcription of piano music based on HMM tracking of jointly-estimated pitches , 2008, 2008 16th European Signal Processing Conference.

[20]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[21]  Lale Akarun,et al.  Large scale polyphonic music transcription using randomized matrix decompositions , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[22]  David Barber,et al.  A generative model for music transcription , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Axel Röbel,et al.  Multiple Fundamental Frequency Estimation and Polyphony Inference of Polyphonic Music Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Brendt Wohlberg,et al.  Piano music transcription with fast convolutional sparse coding , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[25]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[26]  Zhiyao Duan,et al.  Piano music transcription modeling note temporal evolution , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Daniel P. W. Ellis,et al.  Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments , 2011, IEEE Journal of Selected Topics in Signal Processing.

[28]  Simon Dixon,et al.  On the Computer Recognition of Solo Piano Music , 2000 .

[29]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  José Manuel Iñesta Quereda,et al.  Multiple fundamental frequency estimation using Gaussian smoothness , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[32]  Ray Meddis,et al.  Virtual pitch and phase sensitivity of a computer model of the auditory periphery , 1991 .

[33]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[34]  Anssi Klapuri,et al.  Signal Processing Methods for Music Transcription , 2006 .

[35]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[36]  Mark D. Plumbley,et al.  A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[38]  Mark D. Plumbley,et al.  Unsupervised analysis of polyphonic music by sparse coding , 2006, IEEE Transactions on Neural Networks.

[39]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[40]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[42]  Masataka Goto,et al.  A Nonparametric Bayesian Multipitch Analyzer Based on Infinite Latent Harmonic Allocation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[44]  Emmanuel Vincent,et al.  Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Björn Schuller,et al.  Automatic Transcription of Recorded Music , 2012 .

[46]  Simon J. Godsill,et al.  Bayesian harmonic models for musical signal analysis , 2003 .

[47]  Anssi Klapuri,et al.  Multiple fundamental frequency estimation based on harmonicity and spectral smoothness , 2003, IEEE Trans. Speech Audio Process..

[48]  Matti Karjalainen,et al.  A computationally efficient multipitch analysis model , 2000, IEEE Trans. Speech Audio Process..

[49]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[51]  Anssi Klapuri,et al.  Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music , 2008, Computer Music Journal.

[52]  Juhan Nam,et al.  A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations , 2011, ISMIR.

[53]  Bhavik R. Bakshi,et al.  Wave‐net: a multiresolution, hierarchical neural network with localized learning , 1993 .