暂无分享,去创建一个
[1] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.
[3] Nazli Ikizler-Cinbis,et al. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..
[4] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..
[5] Aren Jansen,et al. Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.
[6] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Noah A. Smith,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.
[8] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.
[9] Nobuyuki Shimizu,et al. Cross-Lingual Image Caption Generation , 2016, ACL.
[10] Qi Wu,et al. Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..
[11] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[12] S. Brennan,et al. Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender , 2001, Language and speech.
[13] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[14] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[15] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.
[16] Deb Roy,et al. Grounded spoken language acquisition: experiments in word learning , 2003, IEEE Trans. Multim..
[17] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[18] Bogdan Ludusan,et al. Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.
[19] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[20] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.