End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

[1]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[2]  Satoshi Nakamura,et al.  Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario , 2016, SLTU.

[3]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[7]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[9]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10]  Satoshi Nakamura,et al.  Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge , 2020, INTERSPEECH.

[11]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[14]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[15]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[18]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[19]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[20]  Haizhou Li,et al.  VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Sakriani Sakti,et al.  The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[23]  James R. Glass,et al.  Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[25]  Satoshi Nakamura,et al.  Speech-to-Speech Translation Between Untranscribed Unknown Languages , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Mark Hasegawa-Johnson,et al.  Image 2 speech : Automatically generating audio descriptions of images , 2017 .

[27]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30]  James R. Glass,et al.  Text-Free Image-to-Speech Synthesis Using Learned Segmental Units , 2020, ACL.

[31]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[32]  Mark Hasegawa-Johnson,et al.  Evaluating Automatically Generated Phoneme Captions for Images , 2020, INTERSPEECH.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Satoshi Nakamura,et al.  End-to-End Speech Recognition Sequence Training With Reinforcement Learning , 2019, IEEE Access.

[35]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[36]  Daniel McDuff,et al.  Unpaired Image-to-Speech Synthesis With Multimodal Information Bottleneck , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Lukás Burget,et al.  Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.