Deep image captioning: A review of methods, trends and future challenges

[1]  Suhyun Cho,et al.  Generalized Image Captioning for Multilingual Support , 2023, Applied Sciences.

[2]  Taro Watanabe,et al.  Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[3]  Jun Wang,et al.  An Overview of the Stability Analysis of Recurrent Neural Networks With Multiple Equilibria , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Deep Learning-Based Image Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  A. Mian,et al.  Language Model Agnostic Gray-Box Adversarial Attack on Image Captioning , 2023, IEEE Transactions on Information Forensics and Security.

[6]  Jun Yu,et al.  Joint Embedding of Deep Visual and Semantic Features for Medical Image Report Generation , 2023, IEEE Transactions on Multimedia.

[7]  Joseph Keshet,et al.  A Baseline for Detecting Out-of-Distribution Examples in Image Captioning , 2022, ACM Multimedia.

[8]  P. Sudeep,et al.  Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study , 2022, Circuits, Systems, and Signal Processing.

[9]  Solon Barocas,et al.  Measuring Representational Harms in Image Captioning , 2022, FAccT.

[10]  M. Ackerman,et al.  “So What? What's That to Do With Me?” Expectations of People With Visual Impairments for Image Descriptions in Their Personal Photo Activities , 2022, Conference on Designing Interactive Systems.

[11]  David Abou Chacra,et al.  The Topology and Language of Relationships in the Visual Genome Dataset , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Dan Guo,et al.  Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning , 2022, IEEE Transactions on Cybernetics.

[13]  David A. Ross,et al.  What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Abdulganiyu Abdu Yusuf,et al.  An analysis of graph convolutional networks and recent datasets for visual question answering , 2022, Artificial Intelligence Review.

[15]  Noa García,et al.  Quantifying Societal Bias Amplification in Image Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ziyu Guan,et al.  Special Issue on Decision Making in Heterogeneous Network Data Scenarios and Applications , 2022, World Wide Web.

[17]  Hongmin Cai,et al.  Learning Transferable Perturbations for Image Captioning , 2022, ACM Trans. Multim. Comput. Commun. Appl..

[18]  Yuming Fang,et al.  Revisiting image captioning via maximum discrepancy competition , 2022, Pattern Recognit..

[19]  Huifang Ma,et al.  Dual Global Enhanced Transformer for image captioning , 2022, Neural Networks.

[20]  Zhihui Li,et al.  A Comprehensive Survey of Neural Architecture Search , 2021, ACM Comput. Surv..

[21]  Mingyuan Zhou,et al.  Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning , 2021, International Journal of Computer Vision.

[22]  Mengchu Zhou,et al.  Dynamic Embedding Projection-Gated Convolutional Neural Networks for Text Classification , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Qiang Wu,et al.  Dual Attention on Pyramid Feature Maps for Image Captioning , 2020, IEEE Transactions on Multimedia.

[24]  Xiaodan Liang,et al.  Unifying Relational Sentence Generation and Retrieval for Medical Image Report Composition , 2020, IEEE Transactions on Cybernetics.

[25]  Weili Guan,et al.  Chinese Image Caption Generation via Visual Attention and Topic Modeling , 2020, IEEE Transactions on Cybernetics.

[26]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Fine-Grained Image Captioning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Meredith Ringel Morris,et al.  Going Beyond One-Size-Fits-All Image Descriptions to Satisfy the Information Wants of People Who are Blind or Have Low Vision , 2021, ASSETS.

[28]  Zhengping Che,et al.  Hierarchical Graph Attention Network for Few-shot Visual-Semantic Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Achleshwar Luthra,et al.  MedSkip: Medical Report Generation Using Skip Connections and Integrated Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[30]  Hongwei Mo,et al.  Image Caption Generation Using Multi-Level Semantic Context Information , 2021, Symmetry.

[31]  Olga Russakovsky,et al.  Understanding and Evaluating Racial Biases in Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Amjad Rehman,et al.  Automatic medical image interpretation: State of the art and future directions , 2021, Pattern Recognit..

[33]  Lijuan Wang,et al.  VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning , 2021, AAAI.

[34]  Klaus Diepold,et al.  Multi-agent deep reinforcement learning: a survey , 2021, Artificial Intelligence Review.

[35]  Yeganeh Madadi,et al.  Adversarial Image Caption Generator Network , 2021, SN Computer Science.

[36]  Wei Liu,et al.  Human-like Controllable Image Captioning with Verb-specific Semantic Roles , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Danielle Albers Szafir,et al.  Connecting Human-Robot Interaction and Data Visualization , 2021, 2021 16th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[38]  Zhiqiang Hou,et al.  Research on Image Caption Based on Multiple Word Embedding Representations , 2021, 2021 3rd International Conference on Natural Language Processing (ICNLP).

[39]  Christopher J. Anders,et al.  Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications , 2021, Proceedings of the IEEE.

[40]  Tanaya Guha,et al.  In Defense of Scene Graphs for Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Yong Wang,et al.  Automatic ultrasound image report generation with adaptive multimodal attention mechanism , 2021, Neurocomputing.

[42]  Vicente Ordonez,et al.  Visual News: Benchmark and Challenges in News Image Captioning , 2020, EMNLP.

[43]  Aske Plaat,et al.  A survey of deep meta-learning , 2020, Artificial Intelligence Review.

[44]  Ruixiang Tang,et al.  Mitigating Gender Bias in Captioning Systems , 2020, WWW.

[45]  Yilong Yin,et al.  Unifying Neural Learning and Symbolic Reasoning for Spinal Medical Report Generation , 2020, Medical Image Anal..

[46]  Hanwang Zhang,et al.  Deconfounded Image Captioning: A Causal Retrospect , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Aidong Zhang,et al.  A Survey on Causal Inference , 2020, ACM Trans. Knowl. Discov. Data.

[48]  Karin M. Verspoor,et al.  FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[49]  Zhen Guo,et al.  ImageSem Group at ImageCLEFmed Caption 2021 Task: Exploring the Clinical Significance of the Textual Descriptions Derived from Medical Images , 2021, CLEF.

[50]  Nan Duan,et al.  Control Image Captioning Spatially and Temporally , 2021, ACL.

[51]  Ruqiang Yan,et al.  Domain Adversarial Graph Convolutional Network for Fault Diagnosis Under Variable Working Conditions , 2021, IEEE Transactions on Instrumentation and Measurement.

[52]  Zhenglong Sun,et al.  Intention Understanding in Human–Robot Interaction Based on Visual-NLP Semantics , 2021, Frontiers in Neurorobotics.

[53]  Md. Kishor Morol,et al.  Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit , 2020, 2020 23rd International Conference on Computer and Information Technology (ICCIT).

[54]  John D. Kelleher,et al.  Language-Driven Region Pointer Advancement for Controllable Image Captioning , 2020, COLING.

[55]  Tsung-Hui Chang,et al.  Generating Radiology Reports via Memory-driven Transformer , 2020, EMNLP.

[56]  Chengming Li,et al.  An Ensemble of Generation- and Retrieval-Based Image Captioning With Dual Generator Generative Adversarial Network , 2020, IEEE Transactions on Image Processing.

[57]  Xiaoshuai Sun,et al.  Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal , 2020, ACM Multimedia.

[58]  Zhengcong Fei,et al.  Iterative Back Modification for Faster Image Captioning , 2020, ACM Multimedia.

[59]  Jianwei Niu,et al.  Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism , 2020, ICA3PP.

[60]  Bing Liu,et al.  Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning , 2020, Knowl. Based Syst..

[61]  Ruiqin Xiong,et al.  Visual Relationship Embedding Network for Image Paragraph Generation , 2020, IEEE Transactions on Multimedia.

[62]  Usha Ruby Dr.A,et al.  Binary cross entropy with deep learning technique for Image classification , 2020 .

[63]  Anup Pillai,et al.  Chest X-ray Report Generation through Fine-Grained Label Learning , 2020, MICCAI.

[64]  Xing Xu,et al.  Fooled by Imagination: Adversarial Attack to Image Captioning Via Perturbation in Complex Domain , 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[65]  Yuling Xi,et al.  Stimulus-driven and concept-driven analysis for image caption generation , 2020, Neurocomputing.

[66]  Xu Zhou,et al.  Improving Image Captioning with Better Use of Caption , 2020, ACL.

[67]  Jingsong He,et al.  Boosting image caption generation with feature fusion module , 2020, Multimedia Tools and Applications.

[68]  Li Wen,et al.  Deep learning for ultrasound image caption generation based on object detection , 2020, Neurocomputing.

[69]  Zhe Gan,et al.  Improving Adversarial Text Generation by Modeling the Distant Future , 2020, ACL.

[70]  Jing Liu,et al.  Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning , 2020, IJCAI.

[71]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[72]  Hongyuan Zha,et al.  Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption , 2020, AAAI.

[73]  Bodo Rosenhahn,et al.  Image Captioning through Image Transformer , 2020, ACCV.

[74]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Jing Liu,et al.  Normalized and Geometry-Aware Self-Attention Network for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Yiran Chen,et al.  A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[77]  Weili Guan,et al.  Image caption generation with dual attention mechanism , 2020, Inf. Process. Manag..

[78]  Ajay Bansal,et al.  Ensemble Learning on Deep Neural Networks for Image Caption Generation , 2020, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[79]  Junbo Wang,et al.  Learning visual relationship and context-aware attention for image captioning , 2020, Pattern Recognit..

[80]  Xinlei Chen,et al.  In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Yue Zhang,et al.  An Overview of Image Caption Generation Methods , 2020, Comput. Intell. Neurosci..

[82]  Marcella Cornia,et al.  Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Heng Tao Shen,et al.  Hierarchical LSTMs with Adaptive Attention for Visual Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Xian Wu,et al.  Prophet Attention: Predicting Attention with Future Attention , 2020, NeurIPS.

[85]  Steven Horng,et al.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , 2019, Scientific Data.

[86]  Jaewoo Kang,et al.  Graph Transformer Networks , 2019, NeurIPS.

[87]  Xiaojun Wan,et al.  Generating Diverse and Descriptive Image Captions Using Visual Paraphrases , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[88]  Liang Sun,et al.  Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[89]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[90]  I. Kweon,et al.  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP.

[91]  Zhe Gan,et al.  TIGEr: Text-to-Image Grounding for Image Caption Evaluation , 2019, EMNLP.

[92]  Lin Li,et al.  Squeeze-and-Excitation Wide Residual Networks in Image Classification , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[93]  Fawaz Sammani,et al.  Look and Modify: Modification Networks for Image Captioning , 2019, BMVC.

[94]  Yu-Wing Tai,et al.  Reflective Decoding Network for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[95]  Hanqing Lu,et al.  Aligning Linguistic Words and Visual Semantic Units for Image Captioning , 2019, ACM Multimedia.

[96]  Fenglin Liu,et al.  Exploring and Distilling Cross-Modal Information for Image Captioning , 2019, IJCAI.

[97]  Jiebo Luo,et al.  Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment , 2019, MICCAI.

[98]  Heng Tao Shen,et al.  Deliberate Attention Networks for Image Captioning , 2019, AAAI.

[99]  Lin Wu,et al.  CORAL8: Concurrent Object Regression for Area Localization in Medical Image Panels , 2019, MICCAI.

[100]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[101]  Hanqing Lu,et al.  MSCap: Multi-Style Image Captioning With Unpaired Stylized Text , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Yuan Yan Tang,et al.  Maximum Likelihood Estimation-Based Joint Sparse Representation for the Classification of Hyperspectral Remote Sensing Images , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[103]  Baoyuan Wu,et al.  Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[106]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Kang Li,et al.  Visual to Text: Survey of Image and Video Captioning , 2019, IEEE Transactions on Emerging Topics in Computational Intelligence.

[108]  Shahram Latifi,et al.  Audio Enhancement and Synthesis using Generative Adversarial Networks: A Survey , 2019, International Journal of Computer Applications.

[109]  Sanja Fidler,et al.  Learning to Caption Images Through a Lifetime by Asking Questions , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[110]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[111]  Paul Babyn,et al.  Generative Adversarial Network in Medical Imaging: A Review , 2018, Medical Image Anal..

[112]  Sungroh Yoon,et al.  How Generative Adversarial Networks and Their Variants Work , 2017, ACM Comput. Surv..

[113]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[114]  Sam Kwong,et al.  Deep sequential fusion LSTM network for image description , 2018, Neurocomputing.

[115]  Shuang Bai,et al.  A survey on automatic image caption generation , 2018, Neurocomputing.

[116]  Tao Xu,et al.  Multimodal Recurrent Model with Attention for Automated Radiology Report Generation , 2018, MICCAI.

[117]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[118]  Bo Dai,et al.  Rethinking the Form of Latent States in Image Captioning , 2018, ECCV.

[119]  Qingyang Xu,et al.  A survey on deep neural network-based image captioning , 2018, The Visual Computer.

[120]  Xuanjing Huang,et al.  Toward Diverse Text Generation with Inverse Reinforcement Learning , 2018, IJCAI.

[121]  Lei Zhang,et al.  Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning , 2018, ArXiv.

[122]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[123]  Yongdong Zhang,et al.  GLA: Global–Local Attention for Image Description , 2018, IEEE Transactions on Multimedia.

[124]  Li Zhang,et al.  A region-based image caption generator with refined descriptions , 2018, Neurocomputing.

[125]  Jinfeng Yi,et al.  Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning , 2017, ACL.

[126]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[127]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[128]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[129]  Xu Sun,et al.  Diversity-Promoting GAN: A Cross-Entropy Based Generative Adversarial Network for Diversified Text Generation , 2018, EMNLP.

[130]  Bo Zhao,et al.  AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.

[131]  Rita Cucchiara,et al.  Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention , 2017 .

[132]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[133]  Kevin Lin,et al.  Adversarial Ranking for Language Generation , 2017, NIPS.

[134]  Min Sun,et al.  Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[135]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[136]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[137]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[138]  Gang Wang,et al.  An Empirical Study of Language CNN for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[139]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[140]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[141]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[142]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[143]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[144]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[145]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[146]  Xirong Li,et al.  Adding Chinese Captions to Images , 2016, ICMR.

[147]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[148]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[149]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[150]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[151]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[152]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[153]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[154]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[155]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[156]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[157]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.