A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network

We present a single-step musical tempo estimation system based solely on a convolutional neural network (CNN). Contrary to existing systems, which typically first identify onsets or beats and then derive a tempo, our system estimates the tempo directly from a conventional melspectrogram in a single step. This is achieved by framing tempo estimation as a multi-class classification problem using a network architecture that is inspired by conventional approaches. The system’s CNN has been trained with the union of three datasets covering a large variety of genres and tempi using problem-specific data augmentation techniques. Two of the three ground-truths are novel and will be released for research purposes. As input the system requires only 11.9 s of audio and is therefore suitable for local as well as global tempo estimation. When used as a global estimator, it performs as well as or better than other state-of-the-art algorithms. Especially the exact estimation of tempo without tempo octave confusion is significantly improved. As local estimator it can be used to identify and visualize tempo drift in musical performances.

[1]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xavier Serra,et al.  Designing efficient architectures for modeling temporal features with convolutional neural networks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Stephen W. Hainsworth,et al.  Techniques for the Automated Analysis of Musical Audio , 2004 .

[4]  Björn W. Schuller,et al.  Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection , 2008, EURASIP J. Audio Speech Music. Process..

[5]  Colin Raffel,et al.  Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching , 2016 .

[6]  Juhan Nam,et al.  Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms , 2017, ArXiv.

[7]  George Tzanetakis,et al.  An experimental comparison of audio tempo induction algorithms , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jyh-Shing Roger Jang,et al.  A supervised learning method for tempo estimation of musical audio , 2014, 22nd Mediterranean Conference on Control and Automation.

[9]  Geoffroy Peeters,et al.  Template-Based Estimation of Time-Varying Tempo , 2007, EURASIP J. Adv. Signal Process..

[10]  Jaakko Astola,et al.  Analysis of the meter of acoustic musical signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Vassilis Katsouros,et al.  Reducing Tempo Octave Errors by Periodicity Vector Coding And SVM Learning , 2012, ISMIR.

[12]  Anders Elowsson Beat Tracking with a Cepstroid Invariant Neural Network , 2016, ISMIR.

[13]  Eric D. Scheirer,et al.  Tempo and beat analysis of acoustic musical signals. , 1998, The Journal of the Acoustical Society of America.

[14]  Sergi Jordà,et al.  A Multi-Profile Method for Key Estimation in EDM , 2017, Semantic Audio.

[15]  Vassilis Katsouros,et al.  Convolutional Neural Networks for Real-Time Beat Tracking: A Dancing Robot Application , 2017, ISMIR.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  George Tzanetakis,et al.  Streamlined Tempo Estimation Based on Autocorrelation and Cross-correlation With Pulses , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[19]  Geoffroy Peeters,et al.  The extended ballroom dataset , 2016 .

[20]  Meinard Müller,et al.  Exploiting global features for tempo octave correction , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Simon Dixon,et al.  Automatic Extraction of Tempo and Beat From Expressive Performances , 2001 .

[22]  Meinard Mller,et al.  Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications , 2015 .

[23]  Miguel A. Alonso,et al.  Tempo And Beat Estimation Of Musical Signals , 2004, ISMIR.

[24]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[25]  Florian Krebs,et al.  Accurate Tempo Estimation Based on Recurrent Neural Networks and Resonating Comb Filters , 2015, ISMIR.

[26]  Matthew E. P. Davies,et al.  Selective Sampling for Beat Tracking Evaluation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Geoffroy Peeters,et al.  Perceptual tempo estimation using GMM-regression , 2012, MIRUM '12.

[28]  Jarno Sepp nen TATUM GRID ANALYSIS OF MUSICAL SIGNALS , 2001 .

[29]  D. Moelants,et al.  Deviations from the resonance theory of tempo induction , 2004 .

[30]  Peter Knees,et al.  Two Data Sets for Tempo Estimation and Key Detection in Electronic Dance Music Annotated from User Corrections , 2015, ISMIR.

[31]  Peter Knees,et al.  Addressing Tempo Estimation Octave Errors in Electronic Music by Incorporating Style Information Extracted from Wikipedia , 2015 .

[32]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[33]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[34]  Emilia Gómez,et al.  Comparative Evaluation and Combination of Audio Tempo Estimation Approaches , 2011, Semantic Audio.

[35]  Hendrik Schreiber,et al.  Improving Genre Annotations for the Million Song Dataset , 2015, ISMIR.

[36]  Yann LeCun,et al.  Moving Beyond Feature Design: Deep Architectures and Automatic Feature Learning in Music Informatics , 2012, ISMIR.

[37]  Fu-Hai Frank Wu Musical tempo octave error reducing based on the statistics of tempogram , 2015, 2015 23rd Mediterranean Conference on Control and Automation (MED).

[38]  A. Friberg,et al.  Modeling the perception of tempo. , 2015, The Journal of the Acoustical Society of America.