Modeling the Compatibility of Stem Tracks to Generate Music Mashups

A music mashup combines audio elements from two or more songs to create a new work. To reduce the time and effort required to make them, researchers have developed algorithms that predict the compatibility of audio elements. Prior work has focused on mixing unaltered excerpts, but advances in source separation enable the creation of mashups from isolated stems (e.g., vocals, drums, bass, etc.). In this work, we take advantage of separated stems not just for creating mashups, but for training a model that predicts the mutual compatibility of groups of excerpts, using self-supervised and semi-supervised methods. Specifically, we first produce a random mashup creation pipeline that combines stem tracks obtained via source separation, with key and tempo automatically adjusted to match, since these are prerequisites for highquality mashups. To train a model to predict compatibility, we use stem tracks obtained from the same song as positive examples, and random combinations of stems with key and/or tempo unadjusted as negative examples. To improve the model and use more data, we also train on “average” examples: random combinations with matching key and tempo, where we treat them as unlabeled data as their true compatibility is unknown. To determine whether the combined signal or the set of stem signals is more indicative of the quality of the result, we experiment on two model architectures and train them using semi-supervised learning technique. Finally, we conduct objective and subjective evaluations of the system, comparing them to a standard rule-based system.

[1]  Ja-Ling Wu,et al.  Automatic Mashup Creation by Considering both Vertical and Horizontal Mashabilities , 2015, ISMIR.

[2]  Tillman Weyde,et al.  Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[3]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[4]  Christine Boone,et al.  Mashing: Toward a Typology of Recycled Music , 2013 .

[5]  Hsin-Min Wang,et al.  Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing , 2017, ACM Multimedia.

[6]  Yi Yang,et al.  Unlabeled Samples Generated by GAN Improve the Person Re-identification Baseline in Vitro , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jordan B. L. Smith,et al.  Neural Loop Combiner: Neural Network Models for Assessing the Compatibility of Loops , 2020, ArXiv.

[8]  Gilberto Bernardes,et al.  MixMash: A Visualisation System for Musical Mashup Creation , 2018, 2018 22nd International Conference Information Visualisation (IV).

[9]  Margaret A. Boden,et al.  Creativity in a nutshell , 2007, Think.

[10]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ragnhild Brøvig-Hanssen,et al.  Contextual incongruity and musical congruity: the aesthetics and humour of mash-ups , 2012, Popular Music.

[12]  Nao Tokui Massh!: a web-based collective music mashup system , 2008, DIMEA.

[13]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Juan Pablo Bello,et al.  Structured Training for Large-Vocabulary Chord Recognition , 2017, ISMIR.

[15]  Simon Dixon,et al.  Jointly Detecting and Separating Singing Voice: A Multi-Task Approach , 2018, LVA/ICA.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Mike Senior,et al.  Mixing Secrets for the Small Studio , 2011 .

[18]  George Tzanetakis,et al.  Improving Music Transcription by Pre-Stacking A U-Net , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Matthew E. P. Davies,et al.  The Harmonix Set: Beats, Downbeats, and Functional Segment Annotations of Western Popular Music , 2019, ISMIR.

[20]  Gilberto Bernardes,et al.  A Hierarchical Harmonic Mixing Method , 2017, CMMR.

[21]  Florian Krebs,et al.  madmom: A New Python Audio and Music Signal Processing Library , 2016, ACM Multimedia.

[22]  Corentin Dancette,et al.  Sampling strategies in Siamese Networks for unsupervised speech representation learning , 2018, INTERSPEECH.

[23]  Youngmoo E. Kim,et al.  Beat-Sync-Mash-Coder: A web application for real-time creation of beat-synchronous music mashups , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Matthew E. P. Davies,et al.  AutoMashUpper: Automatic Creation of Multi-Song Music Mashups , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[26]  Yair Movshovitz-Attias,et al.  No Fuss Distance Metric Learning Using Proxies , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[28]  Hui Zhang,et al.  PopMash: an automatic musical-mashup system using computation of musical and lyrical agreement for transitions , 2020, Multimedia Tools and Applications.

[29]  Romain Hennequin,et al.  Singing Voice Separation: A Study on Training Data , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[31]  Romain Hennequin,et al.  Spleeter: a fast and efficient music source separation tool with pre-trained models , 2020, J. Open Source Softw..

[32]  Jordan B. L. Smith,et al.  Unmixer: An Interface for Extracting and Remixing Loops , 2019, ISMIR.

[33]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[34]  Jiwen Lu,et al.  Discriminative Deep Metric Learning for Face Verification in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yann LeCun,et al.  Feature learning and deep architectures: new directions for music informatics , 2013, Journal of Intelligent Information Systems.

[36]  Yi-Hsuan Yang,et al.  Generating Music Medleys via Playing Music Puzzle Games , 2017, AAAI.

[37]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[38]  Wlodzislaw Duch,et al.  Computational Creativity , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.