论文信息 - End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

[1] Michael C. Frank,et al. Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[2] Satoshi Nakamura,et al. Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario , 2016, SLTU.

[3] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Florian Metze,et al. Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Sanjeev Khudanpur,et al. Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[7] David A. van Leeuwen,et al. Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[9] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[10] Satoshi Nakamura,et al. Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge , 2020, INTERSPEECH.

[11] Aren Jansen,et al. The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Colin Raffel,et al. librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[14] Sakriani Sakti,et al. The Zero Resource Speech Challenge 2019: TTS without T , 2019, INTERSPEECH.

[15] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] M. Kramer. Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[18] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[19] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[20] Haizhou Li,et al. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[22] Sakriani Sakti,et al. The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units , 2020, INTERSPEECH.

[23] James R. Glass,et al. Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[25] Satoshi Nakamura,et al. Speech-to-Speech Translation Between Untranscribed Unknown Languages , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26] Mark Hasegawa-Johnson,et al. Image 2 speech : Automatically generating audio descriptions of images , 2017 .

[27] Aren Jansen,et al. Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[28] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[29] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30] James R. Glass,et al. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units , 2020, ACL.

[31] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[32] Mark Hasegawa-Johnson,et al. Evaluating Automatically Generated Phoneme Captions for Images , 2020, INTERSPEECH.

[33] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34] Satoshi Nakamura,et al. End-to-End Speech Recognition Sequence Training With Reinforcement Learning , 2019, IEEE Access.

[35] Sebastian Stüker,et al. Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[36] Daniel McDuff,et al. Unpaired Image-to-Speech Synthesis With Multimodal Information Bottleneck , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Lukás Burget,et al. Variational Inference for Acoustic Unit Discovery , 2016, Workshop on Spoken Language Technologies for Under-resourced Languages.