Generating image descriptions using capsule network

Abstract Using Capsule network to generate natural language descriptions of images with different orientation, shapes, sizes and regions is a novel technique to address image to text problem. In this paper we are going to present a model that generates image descriptions using convolutional neural network and capsule network. First, the model will be trained on MNIST dataset for testing the accuracy of identifying the images with different orientation using capsule network and after that we used Flickr8K [3] and Flickr30K [4, 5] datasets over CNN and bidirectional recurrent neural network to generate text descriptions. We have added an additional steps to change the orientation of these images by rotating them to certain angles which will help in creating a new set of training samples for images with different orientations. Second, we analysed the performance of both the system separately and combined, calculate the aggregated result while taking in account the complexity of image descriptor and pose content of images. We found that this idea outperform significantly on describing images with different orientations and scalable to adapt new images apart from the training samples without predefined set of guidelines.

[1]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[2]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[3]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[4]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[5]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[7]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Amit Prakash Singh,et al.  Dynamic Routing Using Inter Capsule Routing Protocol between Capsules , 2018, 2018 UKSim-AMSS 20th International Conference on Computer Modelling and Simulation (UKSim).

[9]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[10]  Yong-Sheng Chen,et al.  Batch-normalized Maxout Network in Network , 2015, ArXiv.

[11]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[12]  Beat Fasel,et al.  Rotation-Invariant Neoperceptron , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[13]  Andrea Vedaldi,et al.  Learning Covariant Feature Detectors , 2016, ECCV Workshops.

[14]  David Klinghoffer,et al.  Baby talk. , 1995, First things.