Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech

There is a growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here, we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate soft text labels from images, which serve as targets for a neural model that maps untranscribed speech to (semantic) keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic retrievals. Compared to a supervised model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches. We perform an extensive analysis of the model and its resulting representations.

[1]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[4]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[5]  James R. Glass,et al.  Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[6]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[7]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[8]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[9]  Kevin Gimpel,et al.  From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.

[10]  James R. Glass,et al.  Zero resource spoken audio corpus analysis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[13]  Dimitri Palaz,et al.  Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.

[14]  Emmanuel Dupoux,et al.  Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[15]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[17]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[18]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[19]  Herman Kamper,et al.  Unsupervised neural and Bayesian models for zero-resource speech processing , 2017, ArXiv.

[20]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[23]  Antoni Rodríguez-Fornells,et al.  Speech segmentation is facilitated by visual cues , 2010 .

[24]  Eve V. Clark,et al.  Language and Conceptual Development series How language acquisition builds on cognitive development , 2004 .

[25]  Florian Metze,et al.  Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Michael C. Frank,et al.  PSYCHOLOGICAL SCIENCE Research Article Using Speakers ’ Referential Intentions to Model Early Cross-Situational Word Learning , 2022 .

[28]  Grzegorz Chrupala,et al.  Representations of language in a model of visually grounded speech signal , 2017, ACL.

[29]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[30]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[31]  Timothy J. Hazen,et al.  Speech-based annotation and retrieval of digital photographs , 2007, INTERSPEECH.

[32]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[33]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[35]  James R. Glass,et al.  Look, listen, and decode: Multimodal speech recognition with images , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[38]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[39]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[40]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[41]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[42]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[43]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[44]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[45]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[46]  Heikki Rasilo,et al.  A joint model of word segmentation and meaning acquisition through cross-situational learning. , 2015, Psychological review.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[49]  W. Bruce Croft,et al.  Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.

[50]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[51]  Grzegorz Chrupala,et al.  From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning , 2016, COLING.

[52]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[53]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[54]  Bert Cranen,et al.  A computational model for unsupervised word discovery , 2007, INTERSPEECH.

[55]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[56]  Guillaume Aimetti,et al.  Modelling Early Language Acquisition Skills: Towards a General Statistical Learning Mechanism , 2009, EACL.

[57]  Aren Jansen,et al.  Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[59]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[61]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[62]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[63]  Utpal Garain,et al.  Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.

[64]  Linda B. Smith,et al.  Rapid Word Learning Under Uncertainty via Cross-Situational Statistics , 2007, Psychological science.

[65]  James R. Glass,et al.  Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[66]  Dana H. Ballard,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..

[67]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[68]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[69]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[70]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[71]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  James Glass,et al.  Analysis of Audio-Visual Features for Unsupervised Speech Recognition , 2017 .

[73]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[74]  Lin-Shan Lee,et al.  Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[75]  Emmanuel Dupoux,et al.  Learning Words from Images and Speech , 2014 .

[76]  Gregory Shakhnarovich,et al.  Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.

[77]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[78]  Erik D. Thiessen Effects of Visual Information on Adults' and Infants' Auditory Statistical Learning , 2010, Cogn. Sci..

[79]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[80]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[81]  James R. Glass,et al.  Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[83]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Tomoaki Nakamura,et al.  Symbol emergence in robotics: a survey , 2015, Adv. Robotics.

[85]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[86]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[87]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[88]  Florian Metze,et al.  Visual features for context-aware speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[89]  Barbara Caputo,et al.  Object Category Detection Using Audio-Visual Cues , 2008, ICVS.

[90]  Jiejun Xu,et al.  Multimodal photo annotation and retrieval on a mobile phone , 2008, MIR '08.

[91]  Hugo Van hamme,et al.  Modelling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA , 2011, Neurocomputing.

[92]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[93]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[94]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[95]  Lin-Shan Lee,et al.  Improved semantic retrieval of spoken content by language models enhanced with acoustic similarity graph , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[96]  James R. Glass,et al.  Learning modality-invariant representations for speech and images , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[97]  Herbert Gish,et al.  Keyword Spotting of Arbitrary Words Using Minimal Speech Resources , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[98]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[99]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[100]  Alexandre Bernardino,et al.  Affordance based word-to-meaning association , 2009, 2009 IEEE International Conference on Robotics and Automation.

[101]  J. Siskind A computational study of cross-situational techniques for learning word-to-meaning mappings , 1996, Cognition.