Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network’s architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.

[1]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[2]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[3]  Danny Merkx,et al.  Learning semantic sentence representations from visually grounded language without lexical knowledge , 2019, Natural Language Engineering.

[4]  Mirjam Ernestus,et al.  Language learning using Speech to Image retrieval , 2019, INTERSPEECH.

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Laurent Besacier,et al.  Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech , 2019, CoNLL.

[8]  P. Jusczyk,et al.  Infants′ Detection of the Sound Patterns of Words in Fluent Speech , 1995, Cognitive Psychology.

[9]  J. Vroomen,et al.  Lexical access of resyllabified words: Evidence from phoneme monitoring , 1999, Memory & cognition.

[10]  James R. Glass,et al.  Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio , 2019, INTERSPEECH.

[11]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[12]  Florian Schiel,et al.  Multilingual processing of speech via web services , 2017, Comput. Speech Lang..

[13]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[14]  Laurent Besacier,et al.  Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Otto Jespersen,et al.  Monosyllabism in English , 1928 .

[16]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[17]  Artem Sokolov,et al.  Learning to Segment Inputs for NMT Favors Character-Level Processing , 2018, IWSLT.

[18]  Hung-yi Lee,et al.  Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  James Glass,et al.  Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.

[20]  Kunio Kashino,et al.  Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[22]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.