Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addressed environmental sound synthesis and conversion with statistical generative models; thus, this research area is not yet well organized. In this paper, we review problem definitions, applications, and evaluation methods of environmental sound synthesis and conversion. We then report on environmental sound synthesis using sound event labels, in which we focus on the current performance of statistical environmental sound synthesis and investigate how we should conduct subjective experiments on environmental sound synthesis.

[1]  Parag K. Mital,et al.  Time Domain Neural Audio Style Transfer , 2017, ArXiv.

[2]  Naga K. Govindaraju,et al.  Sound synthesis for impact sounds in video games , 2011, SI3D.

[3]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Patrick Pérez,et al.  Audio Style Transfer , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Kunio Kashino,et al.  Generating Sound Words from Audio Signals of Acoustic Events with Sequence-to-Sequence Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Gaëtan Hadjeres,et al.  Deep Learning Techniques for Music Generation - A Survey , 2017, ArXiv.

[8]  Keisuke Imoto,et al.  Introduction to acoustic event and scene analysis , 2018 .

[9]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[10]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[12]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  Yong Xu,et al.  Acoustic Scene Generation with Conditional Samplernn , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Satoshi Nakamura,et al.  Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition , 2000, LREC.

[15]  Diemo Schwarz,et al.  State of the Art in Sound Texture Synthesis , 2011 .

[16]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[17]  Gilberto Bernardes,et al.  Seed: Resynthesizing Environmental Sounds From Examples , 2016 .