Learning Disentangled Representations for Timber and Pitch in Music Audio

Timbre and pitch are the two main perceptual properties of musical sounds. Depending on the target applications, we sometimes prefer to focus on one of them, while reducing the effect of the other. Researchers have managed to hand-craft such timbre-invariant or pitch-invariant features using domain knowledge and signal processing techniques, but it remains difficult to disentangle them in the resulting feature representations. Drawing upon state-of-the-art techniques in representation learning, we propose in this paper two deep convolutional neural network models for learning disentangled representation of musical timbre and pitch. Both models use encoders/decoders and adversarial training to learn music representations, but the second model additionally uses skip connections to deal with the pitch information. As music is an art of time, the two models are supervised by frame-level instrument and pitch labels using a new dataset collected from MuseScore. We compare the result of the two disentangling models with a new evaluation protocol called "timbre crossover", which leads to interesting applications in audio-domain music editing. Via various objective evaluations, we show that the second model can better change the instrumentation of a multi-instrument music piece without much affecting the pitch structure. By disentangling timbre and pitch, we envision that the model can contribute to generating more realistic music audio as well.

[1]  Yi-Hsuan Yang,et al.  MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation , 2017, ISMIR.

[2]  Daniel P. W. Ellis,et al.  Signal Processing for Music Analysis , 2011, IEEE Journal of Selected Topics in Signal Processing.

[3]  Brian Christopher Smith,et al.  Query by humming: musical information retrieval in an audio database , 1995, MULTIMEDIA '95.

[4]  Jose D. Fernández,et al.  AI Methods in Algorithmic Composition: A Comprehensive Survey , 2013, J. Artif. Intell. Res..

[5]  Bruno A. Olshausen,et al.  Discovering Hidden Factors of Variation in Deep Networks , 2014, ICLR.

[6]  Lior Wolf,et al.  A Two-Step Disentanglement Method , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[8]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[9]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[10]  Hirokazu Kameoka,et al.  Harmonic-Temporal-Timbral Clustering (HTTC) for the analysis of multi-instrument polyphonic music signals , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Yu Liu,et al.  Exploring Disentangled Feature Representation Beyond Face Identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Ran He,et al.  Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[14]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[15]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[16]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[17]  Yi-Hsuan Yang,et al.  MuseGAN: Symbolic-domain Music Generation and Accompaniment with Multi-track Sequential Generative Adversarial Networks , 2017, ArXiv.

[18]  Yi-Hsuan Yang,et al.  Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation , 2018, ISMIR.

[19]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[20]  Xiaoming Liu,et al.  Disentangled Representation Learning GAN for Pose-Invariant Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Meinard Müller,et al.  Towards Timbre-Invariant Audio Features for Harmony-Based Music , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Yi-Hsuan Yang,et al.  PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network , 2018, AAAI.

[23]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[24]  Chris Donahue,et al.  Synthesizing Audio with Generative Adversarial Networks , 2018, ArXiv.

[25]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[26]  Roger Wattenhofer,et al.  MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer , 2018, ISMIR.

[27]  George Trigeorgis,et al.  Domain Separation Networks , 2016, NIPS.

[28]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[29]  Daniel P. W. Ellis,et al.  Optimizing DTW-based audio-to-MIDI alignment and matching , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[31]  Xavier Serra,et al.  Chroma Binary Similarity and Local Alignment Applied to Cover Song Identification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Colin Raffel,et al.  Learning a Latent Space of Multitrack Measures , 2018, ArXiv.

[33]  Feng Liu,et al.  Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[36]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[37]  Yang Liu,et al.  Multi-task Adversarial Network for Disentangled Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Hao-Min Liu,et al.  LEAD SHEET GENERATION AND ARRANGEMENT VIA A HYBRID GENERATIVE MODEL , 2018 .