Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.

[1]  Dedre Gentner,et al.  Why Nouns Are Learned before Verbs: Linguistic Relativity Versus Natural Partitioning. Technical Report No. 257. , 1982 .

[2]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[3]  Florian Schiel,et al.  Multilingual processing of speech via web services , 2017, Comput. Speech Lang..

[4]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[5]  Akikazu Takeuchi,et al.  STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset , 2017, ACL.

[6]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[7]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[8]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[9]  Grzegorz Chrupala,et al.  Encoding of phonology in a recurrent neural model of grounded speech , 2017, CoNLL.

[10]  E. Gibson,et al.  Principles of Perceptual Learning and Development , 1973 .

[11]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[12]  D. Slobin,et al.  Studies of child language development , 1973 .

[13]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[14]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[15]  Olivier Rosec,et al.  SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, ArXiv.

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[19]  Etsuko Haryu,et al.  Use of bound morphemes (noun particles) in word segmentation by Japanese-learning infants , 2016 .

[20]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[21]  Frank Keller,et al.  Image Pivoting for Learning Multilingual Multimodal Representations , 2017, EMNLP.

[22]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[23]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[24]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.