Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors, such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.

[1]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[2]  Zhiguang Qin,et al.  Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model , 2019, IEEE Access.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[6]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[7]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[8]  Chen Chen,et al.  Improving Image Captioning with Conditional Generative Adversarial Nets , 2018, AAAI.

[9]  Gang Wang,et al.  Unpaired Image Captioning by Language Pivoting , 2018, ECCV.

[10]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Alexandre Boulch,et al.  Fully Convolutional Siamese Networks for Change Detection , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[14]  Eatedal Alabdulkreem,et al.  CaptionNet: Automatic End-to-End Siamese Difference Captioning Model With Attention , 2019, IEEE Access.

[15]  Harsh Jhamtani,et al.  Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.

[16]  Qian Du,et al.  GETNET: A General End-to-End 2-D CNN Framework for Hyperspectral Image Change Detection , 2019, IEEE Transactions on Geoscience and Remote Sensing.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Trevor Darrell,et al.  Robust Change Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[21]  Germán Ros,et al.  Street-view change detection with deconvolutional networks , 2016, Autonomous Robots.

[22]  Lorenzo Bruzzone,et al.  Automatic analysis of the difference image for unsupervised change detection , 2000, IEEE Trans. Geosci. Remote. Sens..

[23]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[29]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[31]  Hanwang Zhang,et al.  Deconfounded Image Captioning: A Causal Retrospect , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.