Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

In this paper, we propose a framework for environmental sound synthesis from onomatopoeic words. As one way of expressing an environmental sound, we can use an onomatopoeic word, which is a character sequence for phonetically imitating a sound. An onomatopoeic word is effective for describing diverse sound features. Therefore, using onomatopoeic words for environmental sound synthesis will enable us to generate diverse environmental sounds. To generate diverse sounds, we propose a method based on a sequence-tosequence framework for synthesizing environmental sounds from onomatopoeic words. We also propose a method of environmental sound synthesis using onomatopoeic words and sound event labels. The use of sound event labels in addition to onomatopoeic words enables us to capture each sound event’s feature depending on the input sound event label. Our subjective experiments show that our proposed methods achieve higher diversity and naturalness than conventional methods using sound event labels.

[1]  Xin Wang,et al.  Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Naga K. Govindaraju,et al.  Sound synthesis for impact sounds in video games , 2011, SI3D.

[4]  Tuomas Virtanen,et al.  Automated audio captioning with recurrent neural networks , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[5]  Félix Gontier,et al.  Privacy Aware Acoustic Scene Synthesis Using Deep Spectral Feature Inversion , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[7]  Shrikanth S. Narayanan,et al.  Vector-based Representation and Clustering of Audio Using Onomatopoeia Words , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[8]  Yong Xu,et al.  Acoustic Scene Generation with Conditional Samplernn , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Ryosuke Yamanishi,et al.  Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion , 2019, ArXiv.

[10]  Satoshi Nakamura,et al.  Sound scene data collection in real acoustical environments , 1999 .

[11]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[12]  Ryosuke Yamanishi,et al.  RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis , 2020, ArXiv.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  Kunio Kashino,et al.  Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model , 2019, DCASE.

[15]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[16]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[17]  Wei Ping,et al.  Multi-Speaker End-to-End Speech Synthesis , 2019, ArXiv.

[18]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[19]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[20]  Kai Wang,et al.  Efficient sound synthesis for natural scenes , 2017, 2017 IEEE Virtual Reality (VR).

[21]  D. Rocchesso,et al.  On the effectiveness of vocal imitations and verbal descriptions of sounds. , 2014, The Journal of the Acoustical Society of America.

[22]  Kunio Kashino,et al.  Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).