Semi-supervised Neural Chord Estimation Based on a Variational Autoencoder with Discrete Labels and Continuous Textures of Chords

This paper describes a statistically-principled semi-supervised method of automatic chord estimation (ACE) that can make effective use of any music signals regardless of the availability of chord annotations. The typical approach to ACE is to train a deep classification model (neural chord estimator) in a supervised manner by using only a limited amount of annotated music signals. In this discriminative approach, prior knowledge about chord label sequences (characteristics of model output) has scarcely been taken into account. In contract, we propose a unified generative and discriminative approach in the framework of amortized variational inference. More specifically, we formulate a deep generative model that represents the complex generative process of chroma vectors (observed variables) from the discrete labels and continuous textures of chords (latent variables). Chord labels and textures are assumed to follow a Markov model favoring self-transitions and a standard Gaussian distribution, respectively. Given chroma vectors as observed data, the posterior distributions of latent chord labels and textures are computed approximately by using deep classification and recognition models, respectively. These three models are combined to form a variational autoencoder and trained jointly in a semi-supervised manner. The experimental results show that the performance of the classification model can be improved by additionally using non-annotated music signals and/or by regularizing the classification model with the Markov model of chord labels and the generative model of chroma vectors even in the fully-supervised condition.

[1]  Yoshua Bengio,et al.  Audio Chord Recognition with Recurrent Neural Networks , 2013, ISMIR.

[2]  Maurizio Omologo,et al.  Use of Hidden Markov Models and Factored Language Models for Automatic Chord Recognition , 2009, ISMIR.

[3]  Tijl De Bie,et al.  An End-to-End Machine Learning System for Harmonic Analysis of Music , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tijl De Bie,et al.  Automatic Chord Estimation from Audio: A Review of the State of the Art , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[6]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[7]  Juan Pablo Bello,et al.  Rethinking Automatic Chord Recognition with Convolutional Neural Networks , 2012, 2012 11th International Conference on Machine Learning and Applications.

[8]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[9]  Ajay Srinivasamurthy,et al.  Chord Recognition Using Duration-explicit Hidden Markov Models , 2012, ISMIR.

[10]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[11]  Ichiro Fujinaga,et al.  An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis , 2011, ISMIR.

[12]  Mark B. Sandler,et al.  20 Years of Automatic Chord Recognition from Audio , 2019, ISMIR.

[13]  Simon Dixon,et al.  Audio Chord Recognition with a Hybrid Recurrent Neural Network , 2015, ISMIR.

[14]  Simon Dixon,et al.  Approximate Note Transcription for the Improved Identification of Difficult Chords , 2010, ISMIR.

[15]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[16]  Tristan Carsault,et al.  Using Musical Relationships Between Chord Labels in Automatic Chord Extraction Tasks , 2018, ISMIR.

[17]  Gerhard Widmer,et al.  Improved Chord Recognition by Combining Duration and Harmonic Language Models , 2018, ISMIR.

[18]  Jonathan Le Roux,et al.  Cycle-consistency Training for End-to-end Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[20]  Yiming Wu,et al.  Automatic Chord Estimation Based on a Frame-wise Convolutional Recurrent Neural Network with Non-Aligned Annotations , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[21]  Gerhard Widmer,et al.  On the Futility of Learning Complex Frame-Level Language Models for Chord Recognition , 2017, Semantic Audio.

[22]  Yu-Kwong Kwok,et al.  Large Vocabulary Automatic Chord Estimation with an Even Chance Training Scheme , 2017, ISMIR.

[23]  Gus Xia,et al.  Large-vocabulary Chord Transcription Via Chord Structure Decomposition , 2019, ISMIR.

[24]  Daniel P. W. Ellis,et al.  Chord segmentation and recognition using EM-trained hidden markov models , 2003, ISMIR.

[25]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[26]  Ole Winther,et al.  Auxiliary Deep Generative Models , 2016, ICML.

[27]  Juan Pablo Bello,et al.  Structured Training for Large-Vocabulary Chord Recognition , 2017, ISMIR.

[28]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[29]  Yiming Wu,et al.  Automatic Audio Chord Recognition With MIDI-Trained Deep Feature and BLSTM-CRF Sequence Decoding Model , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Juan Pablo Bello,et al.  Four Timely Insights on Automatic Chord Estimation , 2015, ISMIR.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Malcolm Slaney,et al.  Acoustic Chord Transcription and Key Extraction From Audio Using Key-Dependent HMMs Trained on Synthesized Audio , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Li Su,et al.  Harmony Transformer: Incorporating Chord Segmentation into Harmony Recognition , 2019, ISMIR.

[34]  Gerhard Widmer,et al.  A fully convolutional deep auditory model for musical chord recognition , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[35]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[36]  Christopher Harte,et al.  Towards automatic extraction of harmony information from music signals , 2010 .

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Emilien Dupont,et al.  Joint-VAE: Learning Disentangled Joint Continuous and Discrete Representations , 2018, NeurIPS.

[39]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.