Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Although image and sentence matching has been widely studied, its intrinsic few-shot problem is commonly ignored, which has become a bottleneck for further performance improvement. In this work, we focus on this challenging problem of few-shot image and sentence matching, and propose a Gated Visual-Semantic Embedding (GVSE) model to deal with it. The model consists of three corporative modules in terms of uncommon VSE, common VSE, and gated metric fusion. The uncommon VSE exploits external auxiliary resources to extract generic features for representing uncommon instances and words in images and sentences, and then integrates them by modeling their semantic relation to obtain global representations for association analysis. To better model other common instances and words in rest content of images and sentences, the common VSE learns their discriminative representations directly from scratch. After obtaining two similarity metrics from the two VSE modules with different advantages, the gated metric fusion module adaptively fuses them by automatically balancing their relative importance. Based on the fused metric, we perform extensive experiments in terms of few-shot and conventional image and sentence matching, and demonstrate the effectiveness of the proposed model by achieving the state-of-the-art results on two public benchmark datasets.