论文信息 - Learning Shared Multimodal Embeddings with Unpaired Data

Learning Shared Multimodal Embeddings with Unpaired Data

In this paper, we propose a method to learn a joint multimodal embedding space. We compare the effect of various constraints using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.

Michael S. Ryoo | A. J. Piergiovanni | A. Piergiovanni | M. Ryoo

[1] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[3] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Bernt Schiele,et al. Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Michael S. Ryoo,et al. Fine-Grained Activity Recognition in Baseball Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[7] Ivan Laptev,et al. Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[9] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[11] Geoffrey E. Hinton,et al. Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[12] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[13] Bingbing Ni,et al. Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Shaogang Gong,et al. Semantic embedding space for zero-shot action recognition , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[15] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[16] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[17] Shaogang Gong,et al. Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Xun Xu,et al. Transductive Zero-Shot Action Recognition by Word-Vector Embedding , 2015, International Journal of Computer Vision.

[20] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[21] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22] Raymond J. Mooney,et al. Using closed captions as supervision for video activity recognition , 2010, AAAI 2010.

[23] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Kate Saenko,et al. Joint Event Detection and Description in Continuous Video Streams , 2018, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[27] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[30] Ping Tan,et al. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] Philip H. S. Torr,et al. An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[32] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Jan Kautz,et al. Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[34] Philip Bachman,et al. Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data , 2018, ICML.

[35] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[36] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[37] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[38] Naokazu Yokoya,et al. Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[39] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[43] Venkatesh Saligrama,et al. Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[45] Michael S. Ryoo,et al. Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters , 2016, AAAI 2017.

[46] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[47] Abdullah Al Mamun,et al. Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.

[48] Bernt Schiele,et al. Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.