Large-vocabulary Chord Transcription Via Chord Structure Decomposition

While audio chord recognition systems have acquired considerable accuracy on small vocabularies (e.g., major/minor chords), the large-vocabulary chord recognition problem still remains unsolved. This problem hinders the practical usages of audio recognition systems. The difficulty mainly lies in the intrinsic long-tail distribution of chord qualities, and most chord qualities have too few samples for model training. In this paper, we propose a new model for audio chord recognition under a huge chord vocabulary. The core concept is to decompose any chord label into a set of musically meaningful components (e.g., triad, bass, seventh), each with a much smaller vocabulary compared to the size of the overall chord vocabulary. A multitask classifier is then trained to recognize all the components given the audio feature, and then labels of individual components are reassembled to form the final chord label. Experiments show that the proposed system not only achieves state-of-the-art results on traditional evaluation metrics but also performs well on a large vocabulary. Large-vocabulary chord transcription is a difficult task, as the number of chord qualities is large, and the distribution of training chord classes is extremely biased. For example, the Billboard dataset [2], a human-annotated dataset, contains 230 different chord qualities, or equivalently, 2,749 distinct chord classes 1 . While the first 10% chord qualities cover 93.86% of the data, the last 50% chord qualities only cover 0.35% of the data altogether 2 . Such a longtailed chord distribution makes it extremely hard to model rare chord qualities. To bypass the problem, former systems typically adopt two kinds of strategies: chord quality simplification and 1 We here assume that each chord quality can be combined with all possible 12 roots except for the N chord. 2 In calculation, the chord quality counts are weighted by their durations. c © Junyan Jiang, et al. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Junyan Jiang, et al. “Large-Vocabulary Chord Transcription via Chord Structure Decomposition”, 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. maj min 7

[1]  Yoshua Bengio,et al.  Audio Chord Recognition with Recurrent Neural Networks , 2013, ISMIR.

[2]  Gerhard Widmer,et al.  Improved Chord Recognition by Combining Duration and Harmonic Language Models , 2018, ISMIR.

[3]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Wei Li,et al.  Music Chord Recognition Based on Midi-Trained Deep Feature and BLSTM-CRF Hybird Decoding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tijl De Bie,et al.  An End-to-End Machine Learning System for Harmonic Analysis of Music , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Simon Dixon,et al.  Approximate Note Transcription for the Improved Identification of Difficult Chords , 2010, ISMIR.

[7]  Geoffroy Peeters,et al.  Large-Scale Study of Chord Estimation Algorithms Based on Chroma Representation and HMM , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[8]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[9]  Juan Pablo Bello,et al.  Four Timely Insights on Automatic Chord Estimation , 2015, ISMIR.

[10]  W. Bas de Haas,et al.  Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations , 2017, ArXiv.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Gerhard Widmer,et al.  A fully convolutional deep auditory model for musical chord recognition , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[13]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[14]  Gerhard Widmer,et al.  Feature Learning for Chord Recognition: The Deep Chroma Extractor , 2016, ISMIR.

[15]  Maximos A. Kaliakatsos-Papakostas,et al.  An Idiom-independent Representation of Chords for Computational Music Analysis and Generation , 2014, ICMC.

[16]  Johan Pauwels,et al.  Evaluating automatically estimated chord sequences , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Taemin Cho Improved techniques for automatic chord recognition from music audio signals , 2014 .

[18]  Tijl De Bie,et al.  Automatic Chord Estimation from Audio: A Review of the State of the Art , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Yu-Kwong Kwok,et al.  A Hybrid Gaussian-HMM-Deep Learning Approach for Automatic Chord Estimation with Very Large Vocabulary , 2016, ISMIR.

[20]  Mark B. Sandler,et al.  Symbolic Representation of Musical Chords: A Proposed Syntax for Text Annotations , 2005, ISMIR.

[21]  Juan Pablo Bello,et al.  Structured Training for Large-Vocabulary Chord Recognition , 2017, ISMIR.

[22]  Yu-Kwong Kwok,et al.  Large Vocabulary Automatic Chord Estimation with an Even Chance Training Scheme , 2017, ISMIR.

[23]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[24]  Shigeki Sagayama,et al.  HMM-based approach for automatic chord detection using refined acoustic features , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Ichiro Fujinaga,et al.  An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis , 2011, ISMIR.

[26]  Juan Pablo Bello,et al.  Learning a robust Tonnetz-space transform for automatic chord recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.