Encoding of phonology in a recurrent neural model of grounded speech

We study the representation and encoding of phonemes in a recurrent neural network model of grounded speech. We use a model which processes images and their spoken descriptions, and projects the visual and auditory representations into the same semantic space. We perform a number of analyses on how information about individual phonemes is encoded in the MFCC features extracted from the speech signal, and the activations of the layers of the model. Via experiments with phoneme decoding and phoneme discrimination we show that phoneme representations are most salient in the lower layers of the model, where low-level signals are processed at a fine-grained level, although a large amount of phonological information is retain at the top recurrent layer. We further find out that the attention mechanism following the top recurrent layer significantly attenuates encoding of phonology and makes the utterance embeddings much more invariant to synonymy. Moreover, a hierarchical clustering of phoneme representations learned by the network shows an organizational structure of phonemes similar to those proposed in linguistics.

[1]  Grzegorz Chrupala,et al.  From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning , 2016, COLING.

[2]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[3]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Simon King,et al.  Articulatory feature recognition using dynamic Bayesian networks , 2007, Comput. Speech Lang..

[6]  Chin-Hui Lee,et al.  Exploiting deep neural networks for detection-based speech recognition , 2013, Neurocomputing.

[7]  Takao K Hensch,et al.  Critical periods in speech perception: new directions. , 2015, Annual review of psychology.

[8]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[9]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Felix Sun Speech Representation Models for Speech Synthesis and Multimodal Speech Recognition , 2016 .

[12]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[13]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[14]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[17]  N. Mesgarani,et al.  Dynamic Encoding of Acoustic Features in Neural Responses to Continuous Speech , 2017, The Journal of Neuroscience.

[18]  David S. Touretzky,et al.  A Computational Basis for Phonology , 1989, NIPS.

[19]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[21]  G. Dehaene-Lambertz,et al.  Common Neural Basis for Phoneme Processing in Infants and Adults , 2004, Journal of Cognitive Neuroscience.

[22]  Chen Yu,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2003, ICMI '03.

[23]  J. Elman Distributed representations, simple recurrent networks, and grammatical structure , 1991, Machine Learning.

[24]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[25]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[26]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[27]  P. Boersma,et al.  Semantics guide infants' vowel learning: Computational and experimental evidence. , 2016, Infant behavior & development.

[28]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[29]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[30]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[31]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[32]  Robert Frank,et al.  The Acquisition of Anaphora by Simple Recurrent Networks , 2013 .

[33]  N. Burgess Memory for Serial Order : A Network Model of the Phonological Loop and its Timing , 1999 .

[34]  P. D. Eimas,et al.  Speech Perception in Infants , 1971, Science.

[35]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[36]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[37]  Brian Murphy,et al.  Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses , 2014, PloS one.

[38]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[39]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.