Generating Images From Spoken Descriptions

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

[1]  Wen Gao,et al.  Direct Speech-to-Image Translation , 2020, IEEE Journal of Selected Topics in Signal Processing.

[2]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[4]  Gregory Shakhnarovich,et al.  Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[7]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[12]  Jordi Torres,et al.  Wav2Pix: Speech-conditioned Face Generation Using Generative Adversarial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Florian Metze,et al.  Speech Technology for Unwritten Languages , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Florian Metze,et al.  Learned in Speech Recognition: Contextual Acoustic Word Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Chen Change Loy,et al.  Everybody’s Talkin’: Let Me Talk as You Want , 2020, IEEE Transactions on Information Forensics and Security.

[16]  James R. Glass,et al.  Towards Visually Grounded Sub-word Speech Unit Discovery , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Bernt Schiele,et al.  F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kyogu Lee,et al.  From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech , 2020, ICLR.

[21]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bhiksha Raj,et al.  Face Reconstruction from Voice using Generative Adversarial Networks , 2019, NeurIPS.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[28]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[29]  Shijian Lu,et al.  Cascade EF-GAN: Progressive Facial Expression Editing With Local Focuses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[32]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[35]  Jihua Zhu,et al.  S2IGAN: Speech-to-Image Generation via Adversarial Learning , 2020, INTERSPEECH.

[36]  Jihua Zhu,et al.  Domain segmentation and adjustment for generalized zero-shot learning , 2020, ArXiv.

[37]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[38]  Xin Li,et al.  Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Dacheng Tao,et al.  Perceptual Adversarial Networks for Image-to-Image Transformation , 2017, IEEE Transactions on Image Processing.

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[42]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[43]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Prateek Verma,et al.  Audio-linguistic Embeddings for Spoken Sentences , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Dan Su,et al.  Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams , 2020, ArXiv.

[46]  Rui Wang,et al.  Deep Audio-visual Learning: A Survey , 2020, International Journal of Automation and Computing.

[47]  Mirjam Ernestus,et al.  Language learning using Speech to Image retrieval , 2019, INTERSPEECH.

[48]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[50]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[51]  Zi Huang,et al.  Leveraging the Invariant Side of Generative Zero-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Shiguang Shan,et al.  Generative Adversarial Network with Spatial Attention for Face Attribute Editing , 2018, ECCV.

[53]  Thomas Fevens,et al.  Dual Adversarial Inference for Text-to-Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[55]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.