One Billion Audio Sounds from GPU-Enabled Modular Synthesis

We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using these datasets, we demonstrate new rank-based evaluation criteria for existing audio representations. Finally, we propose a novel approach to synthesizer hyperparameter optimization.

[1]  Douglas Eck,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[2]  Xavier Serra,et al.  FSD50K: an Open Dataset of Human-Labeled Sound Events , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Mark d'Inverno,et al.  Automatic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[4]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[5]  Paul Mulholland,et al.  A Critical Analysis of Synthesizer User Interfaces for Timbre , 2004 .

[6]  J. Grey Multidimensional perceptual scaling of musical timbres. , 1977, The Journal of the Acoustical Society of America.

[7]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[8]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[9]  Tatsuya Harada,et al.  Neural Granular Sound Synthesis , 2020, ArXiv.

[10]  Neil Zeghidour,et al.  Contrastive Learning of General-Purpose Audio Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Adrien Bardet,et al.  Universal audio synthesizer control with normalizing flows , 2019, ArXiv.

[12]  Abhijit Mahabal,et al.  How Large Are Lions? Inducing Distributions over Quantitative Attributes , 2019, ACL.

[13]  George Fazekas,et al.  Timbre Space Representation of a Subtractive Synthesizer , 2020, ArXiv.

[14]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[15]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Xavier Serra,et al.  Neural Percussive Synthesis Parameterised by High-Level Timbral Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Justin Salamon,et al.  Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Xavier Serra,et al.  COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations , 2020, ArXiv.

[20]  Kunio Kashino,et al.  BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[21]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[22]  Xin Wang,et al.  Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[24]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[25]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[26]  Nicholas J. Bryan,et al.  A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences , 2020, Interspeech.

[27]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[28]  James W. Beauchamp,et al.  Machine Tongues XVI: Genetic Algorithms and Their Application to FM Matching Synthesis , 1993 .

[29]  Joseph P. Turian,et al.  I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch , 2020, ArXiv.

[30]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  John M. Chowning,et al.  The Synthesis of Complex Audio Spectra by Means of Frequency Modulation , 1973 .

[34]  Tie-Yan Liu,et al.  A Theoretical Analysis of NDCG Type Ranking Measures , 2013, COLT.

[35]  K. Tatar,et al.  Introducing Latent Timbre Synthesis , 2020, ArXiv.

[36]  R. Shepard Geometrical approximations to the structure of musical pitch. , 1982, Psychological review.