Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech
暂无分享,去创建一个
Gregory Shakhnarovich | Karen Livescu | Herman Kamper | Karen Livescu | Gregory Shakhnarovich | H. Kamper
[1] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.
[2] Stan Davis,et al. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .
[3] Yansong Feng,et al. Visual Information in Semantic Representation , 2010, NAACL.
[4] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[5] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.
[6] Adam Lopez,et al. Towards speech-to-text translation without speech recognition , 2017, EACL.
[7] Aren Jansen,et al. The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.
[8] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.
[9] Kevin Gimpel,et al. From Paraphrase Database to Compositional Paraphrase Model and Back , 2015, Transactions of the Association for Computational Linguistics.
[10] James R. Glass,et al. Zero resource spoken audio corpus analysis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[11] Cordelia Schmid,et al. TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.
[12] Sanjeev Khudanpur,et al. Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.
[13] Dimitri Palaz,et al. Jointly Learning to Locate and Classify Words Using Convolutional Networks , 2016, INTERSPEECH.
[14] Emmanuel Dupoux,et al. Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).
[15] Kenneth Ward Church,et al. A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[16] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[17] Karen Livescu,et al. Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.
[18] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.
[19] Herman Kamper,et al. Unsupervised neural and Bayesian models for zero-resource speech processing , 2017, ArXiv.
[20] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Timothy J. Hazen,et al. Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.
[22] James R. Glass,et al. Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.
[23] Antoni Rodríguez-Fornells,et al. Speech segmentation is facilitated by visual cues , 2010 .
[24] Eve V. Clark,et al. Language and Conceptual Development series How language acquisition builds on cognitive development , 2004 .
[25] Florian Metze,et al. Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta” JSALT 2017 Workshop , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[26] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Michael C. Frank,et al. PSYCHOLOGICAL SCIENCE Research Article Using Speakers ’ Referential Intentions to Model Early Cross-Situational Word Learning , 2022 .
[28] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.
[29] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.
[30] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..
[31] Timothy J. Hazen,et al. Speech-based annotation and retrieval of digital photographs , 2007, INTERSPEECH.
[32] Nazli Ikizler-Cinbis,et al. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..
[33] James R. Glass,et al. Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.
[34] James R. Glass,et al. A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.
[35] James R. Glass,et al. Look, listen, and decode: Multimodal speech recognition with images , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).
[36] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[37] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.
[38] Elia Bruni,et al. Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..
[39] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.
[40] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.
[41] Lukás Burget,et al. Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.
[42] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[43] Carina Silberer,et al. Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.
[44] Timothy J. Hazen,et al. Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.
[45] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[46] Heikki Rasilo,et al. A joint model of word segmentation and meaning acquisition through cross-situational learning. , 2015, Psychological review.
[47] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[48] Sabine Schulte im Walde,et al. A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.
[49] W. Bruce Croft,et al. Quary Expansion Using Local and Global Document Analysis , 1996, SIGIR Forum.
[50] Gemma Boleda,et al. Distributional Semantics in Technicolor , 2012, ACL.
[51] Grzegorz Chrupala,et al. From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning , 2016, COLING.
[52] Sebastian Stüker,et al. Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.
[53] David Chiang,et al. An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.
[54] Bert Cranen,et al. A computational model for unsupervised word discovery , 2007, INTERSPEECH.
[55] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[56] Guillaume Aimetti,et al. Modelling Early Language Acquisition Skills: Towards a General Statistical Learning Mechanism , 2009, EACL.
[57] Aren Jansen,et al. Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[58] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.
[59] Hung-An Chang,et al. Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[60] Alex Pentland,et al. Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..
[61] Kilian Q. Weinberger,et al. Fast Image Tagging , 2013, ICML.
[62] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.
[63] Utpal Garain,et al. Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.
[64] Linda B. Smith,et al. Rapid Word Learning Under Uncertainty via Cross-Situational Statistics , 2007, Psychological science.
[65] James R. Glass,et al. Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[66] Dana H. Ballard,et al. A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..
[67] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[68] Chin-Hui Lee,et al. Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..
[69] James R. Glass,et al. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.
[70] Gerhard Weikum,et al. The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.
[71] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] James Glass,et al. Analysis of Audio-Visual Features for Unsupervised Speech Recognition , 2017 .
[73] Tanja Schultz,et al. Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..
[74] Lin-Shan Lee,et al. Towards unsupervised semantic retrieval of spoken content with query expansion based on automatically discovered acoustic patterns , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.
[75] Emmanuel Dupoux,et al. Learning Words from Images and Speech , 2014 .
[76] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.
[77] Felix Hill,et al. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.
[78] Erik D. Thiessen. Effects of Visual Information on Adults' and Infants' Auditory Statistical Learning , 2010, Cogn. Sci..
[79] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[80] Yiannis Aloimonos,et al. Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.
[81] James R. Glass,et al. Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[82] W. Bruce Croft,et al. Query expansion using local and global document analysis , 1996, SIGIR '96.
[83] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[84] Tomoaki Nakamura,et al. Symbol emergence in robotics: a survey , 2015, Adv. Robotics.
[85] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.
[86] Carina Silberer,et al. Grounded Models of Semantic Representation , 2012, EMNLP.
[87] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[88] Florian Metze,et al. Visual features for context-aware speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[89] Barbara Caputo,et al. Object Category Detection Using Audio-Visual Cues , 2008, ICVS.
[90] Jiejun Xu,et al. Multimodal photo annotation and retrieval on a mobile phone , 2008, MIR '08.
[91] Hugo Van hamme,et al. Modelling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA , 2011, Neurocomputing.
[92] Nick Craswell,et al. Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.
[93] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.
[94] Eneko Agirre,et al. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.
[95] Lin-Shan Lee,et al. Improved semantic retrieval of spoken content by language models enhanced with acoustic similarity graph , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).
[96] James R. Glass,et al. Learning modality-invariant representations for speech and images , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[97] Herbert Gish,et al. Keyword Spotting of Arbitrary Words Using Minimal Speech Resources , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.
[98] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..
[99] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[100] Alexandre Bernardino,et al. Affordance based word-to-meaning association , 2009, 2009 IEEE International Conference on Robotics and Automation.
[101] J. Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings , 1996, Cognition.