On the Futility of Learning Complex Frame-Level Language Models for Chord Recognition

Chord recognition systems use temporal models to post-process frame-wise chord preditions from acoustic models. Traditionally, first-order models such as Hidden Markov Models were used for this task, with recent works suggesting to apply Recurrent Neural Networks instead. Due to their ability to learn longer-term dependencies, these models are supposed to learn and to apply musical knowledge, instead of just smoothing the output of the acoustic model. In this paper, we argue that learning complex temporal models at the level of audio frames is futile on principle, and that non-Markovian models do not perform better than their first-order counterparts. We support our argument through three experiments on the McGill Billboard dataset. The first two show 1) that when learning complex temporal models at the frame level, improvements in chord sequence modelling are marginal; and 2) that these improvements do not translate when applied within a full chord recognition system. The third, still rather preliminary experiment gives first indications that the use of complex sequential models for chord prediction at higher temporal levels might be more promising.

[1]  Liang Lu,et al.  Segmental Recurrent Neural Networks for End-to-End Speech Recognition , 2016, INTERSPEECH.

[2]  Simon Dixon,et al.  Simultaneous Estimation of Chords and Musical Context From Audio , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[4]  Tijl De Bie,et al.  Automatic Chord Estimation from Audio: A Review of the State of the Art , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  BenetosEmmanouil,et al.  An end-to-end neural network for polyphonic piano music transcription , 2016 .

[6]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[7]  Juan Pablo Bello,et al.  On the Relative Importance of Individual Components of Chord Recognition Systems , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Daniel P. W. Ellis,et al.  Analyzing Song Structure with Spectral Clustering , 2014, ISMIR.

[9]  Gerhard Widmer,et al.  A fully convolutional deep auditory model for musical chord recognition , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[10]  Yoshua Bengio,et al.  Audio Chord Recognition with Recurrent Neural Networks , 2013, ISMIR.

[11]  Ichiro Fujinaga,et al.  An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis , 2011, ISMIR.

[12]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[13]  Ajay Srinivasamurthy,et al.  Chord Recognition Using Duration-explicit Hidden Markov Models , 2012, ISMIR.

[14]  Tijl De Bie,et al.  An End-to-End Machine Learning System for Harmonic Analysis of Music , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Simon Dixon,et al.  Audio Chord Recognition with a Hybrid Recurrent Neural Network , 2015, ISMIR.

[16]  Gerhard Widmer,et al.  Getting Closer to the Essence of Music , 2016, ACM Trans. Intell. Syst. Technol..

[17]  Ajay Srinivasamurthy,et al.  A generalized Bayesian model for tracking long metrical cycles in acoustic music signals , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Gerhard Widmer,et al.  Feature Learning for Chord Recognition: The Deep Chroma Extractor , 2016, ISMIR.

[19]  Ron J. Weiss,et al.  Exploring common variations in state of the art chord recognition systems , 2010 .

[20]  Jean-Pierre Martens,et al.  Combining Musicological Knowledge About Chords and Keys in a Simultaneous Chord and Local Key Estimation System , 2014 .