Automatic Drum Transcription for Polyphonic Recordings Using Soft Attention Mechanisms and Convolutional Neural Networks

Automatic drum transcription is the process of generating symbolic notation for percussion instruments within audio recordings. To date, recurrent neural network (RNN) systems have achieved the highest evaluation accuracies for both drum solo and polyphonic recordings, however the accuracies within a polyphonic context still remain relatively low. To improve accuracy for polyphonic recordings, we present two approaches to the ADT problem: First, to capture the dynamism of features in multiple time-step hidden layers, we propose the use of soft attention mechanisms (SA) and an alternative RNN configuration containing additional peripheral connections (PC). Second, to capture these same trends at the input level, we propose the use of a convolutional neural network (CNN), which uses a larger set of time-step features. In addition, we propose the use of a bidirectional recurrent neural network (BRNN) in the peak-picking stage. The proposed systems are evaluated along with two state-of-the-art ADT systems in five evaluation scenarios, including a newly-proposed evaluation methodology designed to assess the generalisability of ADT systems. The results indicate that all of the newly proposed systems achieve higher accuracies than the stateof- the-art RNN systems for polyphonic recordings and that the additional BRNN peak-picking stage offers slight improvement in certain contexts.

[1]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[2]  Jouni Paulus,et al.  Signal Processing Methods for Drum Transcription and Music Structure Analysis , 2010 .

[3]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[4]  Matthew E. P. Davies,et al.  An open-source drum transcription system for Pure Data and Max MSP , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Simon Dixon,et al.  Drum Transcription via Classification of Bar-Level Rhythmic Patterns , 2014, ISMIR.

[6]  Gerhard Widmer Enhanced peak picking for onset detection with recurrent neural networks , 2013 .

[7]  Masataka Goto,et al.  Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With Harmonic Structure Suppression , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Axel Röbel,et al.  On automatic drum transcription using non-negative matrix deconvolution and itakura saito divergence , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Peter Knees,et al.  Drum transcription from polyphonic music with recurrent neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Jeremy Irvin,et al.  Recurrent Neural Networks with Attention for Genre Classification , 2016 .

[17]  Jason Hockman,et al.  Automatic Drum Transcription Using Bi-Directional Recurrent Neural Networks , 2016, ISMIR.

[18]  Alexander Lerch,et al.  Drum transcription using partially fixed non-negative matrix factorization , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[19]  Gaël Richard,et al.  Transcription and Separation of Drum Signals From Polyphonic Music , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Daniel Gärtner,et al.  Real-Time Transcription and Separation of Drum Recordings Based on NMF Decomposition , 2014, DAFx.

[21]  Gaël Richard,et al.  Robust Downbeat Tracking Using an Ensemble of Convolutional Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Peter Knees,et al.  Recurrent Neural Networks for Drum Transcription , 2016, ISMIR.