A thorough review of models, evaluation metrics, and datasets on image captioning

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  Xirong Li,et al.  Towards annotation-free evaluation of cross-lingual image captioning , 2020, MMAsia.

[3]  Jürgen Schmidhuber,et al.  Applying LSTM to Time Series Predictable Through Time-Window Approaches , 2001, WIRN.

[4]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[5]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[6]  Jun Xu,et al.  Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training , 2020, ACM Multimedia.

[7]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[8]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Yuexian Zou,et al.  Exploring Semantic Relationships for Unpaired Image Captioning , 2021, ArXiv.

[11]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[12]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[13]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[15]  Qingzhong Wang,et al.  Group-based Distinctive Image Captioning with Memory Attention , 2021, ACM Multimedia.

[16]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[17]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[18]  Ahmed Elhagry,et al.  A Thorough Review on Recent Deep Learning Methodologies for Image Captioning , 2021, ArXiv.

[19]  Julien Perez,et al.  Learning Visual Representations with Caption Annotations , 2020, ECCV.

[20]  Yingru Liu,et al.  ReFormer: The Relational Transformer for Image Captioning , 2021, ArXiv.

[21]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[22]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[23]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[24]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[25]  Chengjiang Long,et al.  Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning , 2021, ACM Multimedia.

[26]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[27]  Kyomin Jung,et al.  UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning , 2021, ACL.

[28]  Shuang Bai,et al.  A survey on automatic image caption generation , 2018, Neurocomputing.

[29]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[30]  Fei Sha,et al.  Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[32]  Antoni B. Chan,et al.  Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets , 2020, ECCV.

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[35]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[36]  Ning Ding,et al.  Length-Controllable Image Captioning , 2020, ECCV.

[37]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[38]  Wei Liu,et al.  CPTR: Full Transformer Network for Image Captioning , 2021, ArXiv.

[39]  Di Jin,et al.  Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards , 2020, ECCV.

[40]  Olga Russakovsky,et al.  Towards Unique and Informative Captioning of Images , 2020, ECCV.

[41]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[42]  Bo Ren,et al.  Image captioning by incorporating affective concepts learned from both visual and textual components , 2019, Neurocomputing.

[43]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[44]  Thien Huu Nguyen,et al.  Structural and Functional Decomposition for Personality Image Captioning in a Communication Game , 2020, FINDINGS.

[45]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[46]  Xuanjing Huang,et al.  TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning , 2021, IJCAI.

[47]  Berkan Demirel,et al.  Detection and Captioning with Unseen Object Classes , 2021, ArXiv.

[48]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[49]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[50]  Yejin Choi,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[51]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[52]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[53]  Kyomin Jung,et al.  QACE: Asking Questions to Evaluate an Image Caption , 2021, EMNLP.

[54]  Xiaowei Guo,et al.  Distributed Attention for Grounded Image Captioning , 2021, ACM Multimedia.

[55]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[56]  Yuchen Zhai,et al.  Similar Scenes Arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning , 2021, ACM Multimedia.

[57]  Hefeng Wu,et al.  Fine-Grained Image Captioning With Global-Local Discriminative Objective , 2020, IEEE Transactions on Multimedia.

[58]  Stefan Roth,et al.  Diverse Image Captioning with Context-Object Split Latent Spaces , 2020, NeurIPS.

[59]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[60]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[61]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..