Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception

Multi-modal semantics, which aims to ground semantic representations in perception, has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics. After having shown the quality of such auditorily grounded representations, we show how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments, and provide further analysis. We find that features transfered from deep neural networks outperform bag of audio words approaches. To our knowledge, this is the first work to construct multi-modal models from a combination of textual information and auditory information extracted from deep neural networks, and the first work to evaluate the performance of tri-modal (textual, visual and auditory) semantic models.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  A. Clark,et al.  Artificial Intelligence: The Very Idea. , 1988 .

[3]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[4]  Ashwin K. Vijayakumar,et al.  Sound-Word2Vec: Learning Word Representations Grounded in Sounds , 2017, EMNLP.

[5]  Rada Mihalcea,et al.  Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness , 2011, IJCNLP.

[6]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[7]  Jason Weston,et al.  Multi-Tasking with Joint Semantic Spaces for Large-Scale Music Annotation and Retrieval , 2011 .

[8]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[10]  Felix Hill,et al.  Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[11]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[12]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[16]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[17]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[18]  Benjamin Schrauwen,et al.  Audio-based Music Classification with a Pretrained Convolutional Network , 2011, ISMIR.

[19]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[20]  Sander Dieleman,et al.  Learning feature hierarchies for musical audio signals , 2015 .

[21]  Murat Akbacak,et al.  Bag-of-Audio-Words Approach for Multimedia Event Classification , 2012, INTERSPEECH.

[22]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[23]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[24]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[25]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[26]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[27]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[28]  Emiel van Miltenburg,et al.  Sound-based distributional models , 2015, IWCS.

[29]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[31]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[32]  Emmanuel Dupoux,et al.  Learning Words from Images and Speech , 2014 .

[33]  Stephen Clark,et al.  Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception , 2015, EMNLP.

[34]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[35]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[36]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[37]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[38]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[39]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[40]  Marco Baroni,et al.  Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[41]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[42]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[43]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[44]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[45]  Stephen Clark,et al.  Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps , 2016, NAACL.

[46]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[47]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[48]  Stephen Clark,et al.  Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More , 2014, ACL.

[49]  Murat Akbacak,et al.  KDDI LABS and SRI International at TRECVID 2010: Content-Based Copy Detection , 2010, TRECVID.

[50]  Stephen Clark,et al.  Grounding Semantics in Olfactory Perception , 2015, ACL.

[51]  Antti J. Eronen,et al.  Musical instrument recognition using ICA-based transform of features and discriminatively trained HMMs , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[52]  Stephen Clark,et al.  Visual Bilingual Lexicon Induction with Transferred ConvNet Features , 2015, EMNLP.

[53]  Max M. Louwerse,et al.  Symbol Interdependency in Symbolic and Embodied Cognition , 2011, Top. Cogn. Sci..

[54]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[55]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .

[57]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[58]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[59]  Mirella Lapata,et al.  Incremental Models of Natural Language Category Acquisition , 2011, CogSci.

[60]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[61]  Stephen Clark,et al.  Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[62]  Alessandro Lenci,et al.  Distributional semantics in linguistic and cognitive research , 2008 .

[63]  Brian Gygi,et al.  Similarity and categorization of environmental sounds , 2007, Perception & psychophysics.

[64]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[65]  Robert A. Jacobs Learning Multisensory Representations , 2016 .