RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis

Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We believe that using onomatopoeic words will enable us to control the fine time-frequency structure of synthesized sounds. However, there is no dataset available for environmental sound synthesis using onomatopoeic words. In this paper, we thus present RWCP-SSD-Onomatopoeia, a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis. We also collected self-reported confidence scores and others-reported acceptance scores of onomatopoeic words, to help us investigate the difficulty in the transcription and selection of a suitable word for environmental sound synthesis.

[1]  Thad Hughes,et al.  Building transcribed speech corpora quickly and cheaply for many languages , 2010, INTERSPEECH.

[2]  Yong Xu,et al.  Acoustic Scene Generation with Conditional Samplernn , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[4]  Tomoki Koriyama,et al.  JVS corpus: free Japanese multi-speaker voice corpus , 2019, ArXiv.

[5]  Kunio Kashino,et al.  Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tuomas Virtanen,et al.  Crowdsourcing a Dataset of Audio Captions , 2019, DCASE.

[7]  Naga K. Govindaraju,et al.  Sound synthesis for impact sounds in video games , 2011, SI3D.

[8]  Félix Gontier,et al.  Privacy Aware Acoustic Scene Synthesis Using Deep Spectral Feature Inversion , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ryosuke Yamanishi,et al.  Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion , 2019, ArXiv.

[10]  Gilberto Bernardes,et al.  Seed: Resynthesizing Environmental Sounds From Examples , 2016 .

[11]  Jen-Yu Liu,et al.  Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization , 2020, INTERSPEECH.

[12]  Shrikanth S. Narayanan,et al.  Vector-based Representation and Clustering of Audio Using Onomatopoeia Words , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[13]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[14]  D. Rocchesso,et al.  On the effectiveness of vocal imitations and verbal descriptions of sounds. , 2014, The Journal of the Acoustical Society of America.