Bidirectional difference locating and semantic consistency reasoning for change captioning

Change captioning is an emerging task to describe the changes between a pair of images. The difficulty in this task is to discover the differences between the two images. Recently, some methods have been proposed to address this problem. However, they all employ unidirectional difference localization to identify the changes. This can lead to ambiguity about the nature of the changes. Instead, we propose a framework with bidirectional difference localization and semantic consistency reasoning to describe the image changes. First, we locate the changes in the two images by capturing bidirectional differences. Then we design a decoder with spatial‐channel attention to generate the change caption. Finally, we introduce semantic consistency reasoning to constrain our bidirectional difference localization module and spatial‐channel attention module. Extensive experiments on three public data sets show that the performance of our proposed model outperforms the state‐of‐the‐art change captioning models by a large margin.

[1]  Ramakant Nevatia,et al.  Detecting changes in aerial views of man-made structures , 2000, Image Vis. Comput..

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[4]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Gabriel Taubin,et al.  A Variable-Resolution Probabilistic Three-Dimensional Model for Change Detection , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Jing Liu,et al.  Robust Structured Subspace Learning for Data Representation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yang Yang,et al.  Bidirectional Long-Short Term Memory for Video Description , 2016, ACM Multimedia.

[16]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[19]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[20]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[21]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Sequence-Level Image Captioning , 2018, ACM Multimedia.

[22]  Harsh Jhamtani,et al.  Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.

[23]  Runmin Cong,et al.  Co-Saliency Detection for RGBD Images Based on Multi-Constraint Feature Matching and Cross Label Propagation. , 2018, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[24]  Zhiguo Cao,et al.  Supervised guiding long-short term memory for image caption generation based on object classes , 2018, International Symposium on Multispectral Image Processing and Pattern Recognition.

[25]  Franck Dernoncourt,et al.  Expressing Visual Relationships via Language , 2019, ACL.

[26]  Trevor Darrell,et al.  Robust Change Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qingming Huang,et al.  An Iterative Co-Saliency Framework for RGBD Images , 2017, IEEE Transactions on Cybernetics.

[30]  Yu-Gang Jiang,et al.  Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.

[31]  Zhiguang Qin,et al.  Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model , 2019, IEEE Access.

[32]  Eatedal Alabdulkreem,et al.  CaptionNet: Automatic End-to-End Siamese Difference Captioning Model With Attention , 2019, IEEE Access.

[33]  Alexandre Boulch,et al.  Multitask learning for large-scale semantic change detection , 2018, Comput. Vis. Image Underst..

[34]  Yutaka Matsuo,et al.  Epipolar-Guided Deep Object Matching for Scene Change Detection , 2020, ArXiv.

[35]  Fionn Murtagh,et al.  Fine-grained visual understanding and reasoning , 2020, Neurocomputing.

[36]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yue Qiu,et al.  3D-Aware Scene Change Captioning From Multiview Images , 2020, IEEE Robotics and Automation Letters.

[38]  Aaron C. Courville,et al.  Generative adversarial networks , 2020 .

[39]  Junbo Wang,et al.  Learning visual relationship and context-aware attention for image captioning , 2020, Pattern Recognit..

[40]  Juntao Guan,et al.  Fixed Pattern Noise Reduction for Infrared Images Based on Cascade Residual Attention CNN , 2019, Neurocomputing.

[41]  Yue Qiu,et al.  Indoor Scene Change Captioning Based on Multimodality Data , 2020, Sensors.

[42]  Liang Li,et al.  IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning , 2020, ACM Multimedia.

[43]  Jinhui Tang,et al.  Improving OCR-based Image Captioning by Incorporating Geometrical Relationship , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xiaochun Cao,et al.  Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images , 2020, IEEE Transactions on Image Processing.

[45]  Yongjian Wu,et al.  Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network , 2020, AAAI.

[46]  Yang Wang,et al.  Image Change Captioning by Learning from an Auxiliary Task , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Wei Liu,et al.  Human-like Controllable Image Captioning with Verb-specific Semantic Roles , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yiyong Huang,et al.  Video Captioning Based on Channel Soft Attention and Semantic Reconstructor , 2021, Future Internet.

[49]  Meng Wang,et al.  Unpaired Image Captioning With semantic-Constrained Self-Learning , 2021, IEEE Transactions on Multimedia.