Guided Generative Adversarial Neural Network for Representation Learning and High Fidelity Audio Generation using Fewer Labelled Audio Data

Recent improvements in Generative Adversarial Neural Networks (GANs) have shown their ability to generate higher quality samples as well as to learn good representations for transfer learning. Most of the representation learning methods based on GANs learn representations ignoring their post-use scenario, which can lead to increased generalisation ability. However, the model can become redundant if it is intended for a specific task. For example, assume we have a vast unlabelled audio dataset, and we want to learn a representation from this dataset so that it can be used to improve the emotion recognition performance of a small labelled audio dataset. During the representation learning training, if the model does not know the post emotion recognition task, it can completely ignore emotion-related characteristics in the learnt representation. This is a fundamental challenge for any unsupervised representation learning model. In this paper, we aim to address this challenge by proposing a novel GAN framework: Guided Generative Neural Network (GGAN), which guides a GAN to focus on learning desired representations and generating superior quality samples for audio data leveraging fewer labelled samples. Experimental results show that using a very small amount of labelled data as guidance, a GGAN learns significantly better representations.

[1]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[2]  Olivier Bachem,et al.  Recent Advances in Autoencoder-Based Representation Learning , 2018, ArXiv.

[3]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[4]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[5]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[6]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[7]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[8]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[9]  Raja Bala,et al.  Semi-supervised Conditional GANs , 2017, ArXiv.

[10]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[11]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[12]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  Nicki Holighaus,et al.  The Large Time-Frequency Analysis Toolbox 2.0 , 2013, CMMR.

[15]  Rajib Rana,et al.  Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion , 2019, ArXiv.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[18]  D. Dowson,et al.  The Fréchet distance between multivariate normal distributions , 1982 .

[19]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Siwei Ma,et al.  Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[22]  Ling Shao,et al.  Transfer Learning for Visual Categorization: A Survey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[24]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Classification , 2011, AAAI.

[25]  Ivor W. Tsang,et al.  Learning With Augmented Features for Supervised and Semi-Supervised Heterogeneous Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Kumar Krishna Agrawal,et al.  GANSynth: Adversarial Neural Audio Synthesis , 2019, ICLR.

[27]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[28]  Jeff Donahue,et al.  Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[29]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[30]  Xiaohua Zhai,et al.  High-Fidelity Image Generation With Fewer Labels , 2019, ICML.

[31]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Nathanael Perraudin,et al.  Adversarial Generation of Time-Frequency Features with application in audio synthesis , 2019, ICML.

[33]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Rajib Rana,et al.  Image denoising and restoration with CNN-LSTM Encoder Decoder with Direct Attention , 2018, ArXiv.

[35]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[36]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[37]  Rajib Rana,et al.  Direct Modelling of Speech Emotion from Raw Speech , 2019, INTERSPEECH.

[38]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[39]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[40]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[44]  Rajib Rana,et al.  Gait Velocity Estimation Using Time-Interleaved Between Consecutive Passive IR Sensor Activations , 2016, IEEE Sensors Journal.

[45]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[48]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.