论文信息 - Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks -- the gating paradigm -- and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally, we suggest that word representation could be activated through a process of lexical competition.

Laurent Besacier | Jean-Pierre Chevrot | William N. Havard | L. Besacier | Jean-Pierre Chevrot

[1] Quan Z. Sheng,et al. Generating Textual Adversarial Examples for Deep Learning Models: A Survey , 2019, ArXiv.

[2] David A. Wagner,et al. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[3] Grzegorz Chrupala,et al. Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[4] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[5] F Grosjean,et al. Spoken word recognition processes and the gating paradigm , 1980, Perception & psychophysics.

[6] F. Grosjean,et al. The gating paradigm: A comparison of successive and individual presentation formats , 1984 .

[7] Grzegorz Chrupala,et al. Encoding of phonology in a recurrent neural model of grounded speech , 2017, CoNLL.

[8] Laurent Besacier,et al. Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Alex Pentland,et al. Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[10] Gregory Shakhnarovich,et al. Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[12] James S. Magnuson,et al. Spoken Word Recognition , 2013 .

[13] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.

[14] Odette Scharenborg,et al. Models of spoken-word recognition. , 2012, Wiley interdisciplinary reviews. Cognitive science.

[15] William D Marslen-Wilson,et al. Processing interactions and lexical access during word recognition in continuous speech , 1978, Cognitive Psychology.

[16] Mirjam Ernestus,et al. Language learning using Speech to Image retrieval , 2019, INTERSPEECH.

[17] James R. Glass,et al. Towards Visually Grounded Sub-word Speech Unit Discovery , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] D. Norris. Shortlist: a connectionist model of continuous speech recognition , 1994, Cognition.

[19] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20] Kouichi Sakurai,et al. One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[21] W. Marslen-Wilson. Functional parallelism in spoken word-recognition , 1987, Cognition.

[22] Emmanuel Dupoux,et al. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[23] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[24] Haizhou Li,et al. Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[25] James L. McClelland,et al. The TRACE model of speech perception , 1986, Cognitive Psychology.

[26] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.