A Short Review on Image Caption Generation with Deep Learning

Methodologies that utilize Deep Learning offer great potential for applications that automatically attempt to generate captions or descriptions about images. Image captioning is considered to be one of the intellectually challenging problems in imaging science. The application domains include: automatic caption (or description) generation for images for people who suffer from various degrees of visual impairment; the automatic creation of metadata for images (indexing) for use by search engines; general purpose robot vision systems; and many others. Each of these application domains can positively and significantly impact many other task-specific applications. This paper is not meant to be a comprehensive review of image captioning; rather, it is a concise review of image captioning methodologies based on deep learning, strengths and limitations, the datasets and the evaluation metrics used in automatic image captioning. Finally, a quick discussion about the software and hardware requirements for implementing an image captioning method is presented.

[1]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[2]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[5]  Philip H. S. Torr,et al.  Recurrent Instance Segmentation , 2015, ECCV.

[6]  Changhoon Lee,et al.  Context-Aware Middleware and Intelligent Agents for Smart Environments , 2010, IEEE Intell. Syst..

[7]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Rabia Jafri,et al.  Fusion of Face and Gait for Automatic Human Recognition , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[11]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[12]  Fei Sha,et al.  Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hamid R. Arabnia,et al.  Parallel Video Processing Techniques for Surveillance Applications , 2014, 2014 International Conference on Computational Science and Computational Intelligence.

[14]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[15]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16]  Rabia Jafri,et al.  Computer Vision-based Object Recognition for the Visually Impaired Using Visual Tags , 2013 .

[17]  Qiang Wang,et al.  Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[18]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[19]  Hamid R. Arabnia,et al.  Facial Expression Recognition Based on Fuzzy Networks , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[20]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[21]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[24]  Hamid R. Arabnia,et al.  Spatial and Temporal Target Association through Semantic Analysis and GPS Data Mining , 2007, IKE.

[25]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[26]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[27]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[28]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[30]  Hamid R. Arabnia,et al.  OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym , 2016, ISVC.

[31]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[32]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[33]  David A. Patterson,et al.  Motivation for and Evaluation of the First Tensor Processing Unit , 2018, IEEE Micro.

[34]  Hamid R. Arabnia,et al.  Parallel Computer Vision on a Reconfigurable Multiprocessor Network , 1997, IEEE Trans. Parallel Distributed Syst..

[35]  Hamid R. Arabnia,et al.  Dissection of Deep Learning with Applications in Image Recognition , 2018, 2018 International Conference on Computational Science and Computational Intelligence (CSCI).

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Hamid R. Arabnia,et al.  Distributed Global Optimization and its Development on the MultiRing Network , 2004, Neural Parallel Sci. Comput..

[39]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[40]  Hamid R. Arabnia,et al.  Fast Operations on Raster Images with SIMD Machine Architectures , 1986, Comput. Graph. Forum.

[41]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[43]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.