Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval
暂无分享,去创建一个
Mark Hasegawa-Johnson | Najim Dehak | Xinsheng Wang | Odette Scharenborg | Liming Wang | M. Hasegawa-Johnson | O. Scharenborg | Liming Wang | Xinsheng Wang | N. Dehak
[1] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.
[2] Deb Roy,et al. A Computational Model of Word Learning from Multimodal Sensory Input , 2000 .
[3] James Glass,et al. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.
[4] Mark Hasegawa-Johnson,et al. Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[5] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[6] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.
[7] Thomas L. Griffiths,et al. Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.
[8] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[9] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.
[10] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[11] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .
[12] Mirjam Ernestus,et al. Language learning using Speech to Image retrieval , 2019, INTERSPEECH.
[13] Mark Hasegawa-Johnson,et al. A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions , 2020, INTERSPEECH.
[14] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[15] Olivier Rosec,et al. SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, ArXiv.
[16] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[17] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.
[18] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.
[19] Laurent Besacier,et al. Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] James R. Glass,et al. Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[22] Gregory Shakhnarovich,et al. Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[23] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[24] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[26] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..
[27] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.
[28] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[29] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[30] Stephen E. Levinson,et al. The Role of Sensorimotor Function, Associative Memory and Reinforcement Learning in Automatic Acquisition of Spoken Language by an Autonomous Robot , 1996 .
[31] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[32] Michael C. Frank,et al. Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.