SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set
暂无分享,去创建一个
[1] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[2] D. Schwarz,et al. Corpus-Based Concatenative Synthesis , 2007, IEEE Signal Processing Magazine.
[3] Qi Wu,et al. Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..
[4] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[5] Aren Jansen,et al. Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.
[6] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[8] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[9] Grzegorz Chrupala,et al. Representations of language in a model of visually grounded speech signal , 2017, ACL.
[10] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.
[11] Nazli Ikizler-Cinbis,et al. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..
[12] Gregory Shakhnarovich,et al. Visually Grounded Learning of Keyword Prediction from Untranscribed Speech , 2017, INTERSPEECH.
[13] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[14] S. Brennan,et al. Disfluency Rates in Conversation: Effects of Age, Relationship, Topic, Role, and Gender , 2001, Language and speech.
[15] Deb Roy,et al. Grounded spoken language acquisition: experiments in word learning , 2003, IEEE Trans. Multim..
[16] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[17] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..
[18] Bogdan Ludusan,et al. Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems , 2014, LREC.
[19] Nobuyuki Shimizu,et al. Cross-Lingual Image Caption Generation , 2016, ACL.
[20] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.