Conditioning Trick for Training Stable GANs

In this paper we propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training. We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition. This binding makes the generator amenable to truncation and does not limit exploring all the possible modes. We slightly modify the BigGAN architecture incorporating residual network for synthesizing 2D representations of audio signals which enables reconstructing high quality sounds with some preserved phase information. Additionally, the proposed conditional training scenario makes a trade-off between fidelity and variety for the generated spectrograms. The experimental results on UrbanSound8k and ESC-50 environmental sound datasets and the Mozilla common voice dataset have shown that the proposed GAN configuration with the conditioning trick remarkably outperforms baseline architectures, according to three objective metrics: inception score, Frechet inception distance, and signal-to-noise ratio.

[1]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[2]  Junichi Yamagishi,et al.  High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[4]  Chris Donahue,et al.  Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Andrew Brock,et al.  Neural Photo Editing with Introspective Adversarial Networks , 2016, ICLR.

[6]  Alceu de Souza Britto,et al.  Cross-Representation Transferability of Adversarial Attacks: From Spectrograms to Audio Waveforms , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[7]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[8]  Mohamad Esmaeilpour,et al.  From Sound Representation to Model Robustness , 2020, ArXiv.

[9]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[10]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Truyen Tran,et al.  Improving Generalization and Stability of Generative Adversarial Networks , 2019, ICLR.

[12]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[13]  Jacob Abernethy,et al.  On Convergence and Stability of GANs , 2018 .

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Lauri Juvela,et al.  Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis , 2017, INTERSPEECH.

[18]  Patrick Cardinal,et al.  Unsupervised feature learning for environmental sound classification using Weighted Cycle-Consistent Generative Adversarial Network , 2019, Appl. Soft Comput..

[19]  Weidenmüller,et al.  Gaussian orthogonal ensemble statistics in a microwave stadium billiard with chaotic dynamics: Porter-Thomas distribution and algebraic decay of time correlations. , 1995, Physical review letters.

[20]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[21]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[22]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[23]  G. Golub,et al.  Eigenvalue computation in the 20th century , 2000 .

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[26]  G. Phillips Interpolation and Approximation by Polynomials , 2003 .

[27]  Tom J. Moir,et al.  Speech enhancement using Maximum A-Posteriori and Gaussian Mixture Models for speech and noise Periodogram estimation , 2016, Comput. Speech Lang..

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  Hu Hu,et al.  Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Giovanni Panti Multidimensional continued fractions and a Minkowski function , 2007, 0705.0584.

[31]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[32]  Bernard Mulgrew,et al.  The Stationary Phase Approximation, Time-Frequency Decomposition and Auditory Processing , 2012, IEEE Transactions on Signal Processing.

[33]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[34]  강 현배,et al.  웨이블릿 이론과 응용 = Wavelet theory and Its applications , 2001 .

[35]  Alan Edelman,et al.  The Circular Law and the Probability that a Random Matrix Has k Real Eigenvalues , 1993 .

[36]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[39]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[41]  Kyle Forinash,et al.  Time-frequency analysis with the continuous wavelet transform , 1998 .

[42]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[43]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[44]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[45]  S. Mallat A wavelet tour of signal processing , 1998 .

[46]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[47]  Haizhou Li,et al.  Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[48]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[49]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[50]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[51]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[52]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[53]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[54]  Wu,et al.  Gaussian-orthogonal-ensemble level statistics in a one-dimensional system. , 1990, Physical review. A, Atomic, molecular, and optical physics.

[55]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[56]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[57]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[58]  Yashesh Gaur,et al.  Robust Speech Recognition Using Generative Adversarial Networks , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Johnson I. Agbinya,et al.  Discrete wavelet transform techniques in speech processing , 1996, Proceedings of Digital Processing Applications (TENCON '96).

[61]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[62]  O. Rioul,et al.  Wavelets and signal processing , 1991, IEEE Signal Processing Magazine.

[63]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[64]  Marco Marchesi,et al.  Megapixel Size Image Creation using Generative Adversarial Networks , 2017, ArXiv.

[65]  J. Rouat,et al.  Wavelet speech enhancement based on the Teager energy operator , 2001, IEEE Signal Processing Letters.

[66]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[67]  F. Leeb,et al.  Simultaneous amplitude and phase approximation for fir filters , 1989 .

[68]  祐介 日和崎 Stephane Mallat, "A Wavelet Tour of Signal Processing (2nd edition)," Academic Press, 1999(私のすすめるこの一冊,コーヒーブレーク) , 2006 .

[69]  Andrew M. Dai,et al.  Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step , 2017, ICLR.

[70]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[71]  Lucas Theis,et al.  Amortised MAP Inference for Image Super-resolution , 2016, ICLR.

[72]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[73]  Patrick Cardinal,et al.  A Robust Approach for Securing Audio Classification Against Adversarial Attacks , 2019, IEEE Transactions on Information Forensics and Security.

[74]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[75]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[76]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.