Sound Texture Generative Model Guided by a Lossless Mel-Frequency Convolutional Neural Network

Rainstorms, insect swarms, and galloping horses produce “sound textures,” which are the resulting natural sounds of many similar acoustic events. With new achievements emerging regularly for generative models, the deep convolutional neural network (CNN) has proven to be a tremendously successful approach for image and sound synthesis. Existing state-of-the-art sound texture generative models simply treat sound texture signals as 1-D images while discarding the difference between the human vision and auditory systems. This paper considers mel-frequency statistical features, which are designed according to the human auditory system and have been viewed as the dominant features for sound identification. We first construct a CNN structure for extracting mel-frequency features from sounds losslessly. This structure is called mel-frequency CNN (MF-CNN). Next, we investigate a novel sound texture generative model by incorporating the MF-CNN into a convolutional generative network composed of cascading upsampling groups. A jointly alternating back propagation algorithm is proposed to train the overall network. The feedback of the MF-CNN is used to advise the gradients in the inferential and learning back propagation to make the mel-frequency features of the synthesized sounds more similar to the natural ones. Moreover, the proposed generative model can also be extended to other sound synthesis tasks.

[1]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Hermann Ney,et al.  Computing Mel-frequency cepstral coefficients on the power spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[4]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[5]  Sung Yong Shin,et al.  On pixel-based texture synthesis by non-parametric sampling , 2006, Comput. Graph..

[6]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[7]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[8]  Yang Lu,et al.  Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching , 2018, AAAI.

[9]  Tian Han,et al.  Alternating Back-Propagation for Generator Network , 2016, AAAI.

[10]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[13]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[14]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[19]  Alexei A. Efros,et al.  Texture synthesis by non-parametric sampling , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[20]  Tuomas Virtanen,et al.  Filterbank learning for deep neural network based polyphonic sound event detection , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  Daniel P. W. Ellis,et al.  Sound texture modelling with linear prediction in both time and frequency domains , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Arun Kumar Sangaiah,et al.  Automatic Generation of News Comments Based on Gated Attention Neural Networks , 2018, IEEE Access.

[24]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[26]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[28]  Eero P. Simoncelli,et al.  Sound texture synthesis via filter statistics , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.