A Deep Learning Approach for Arabic Caption Generation Using Roots-Words

Automatic caption generation is a key research field in the machine learning community. However, most of the current research is performed on English caption generation ignoring other languages like Arabic and Persian. In this paper, we propose a novel technique leveraging the heavy influence of root words in Arabic to automatically generate captions in Arabic. Fragments of the images are associated with root words and deep belief network pre-trained using Restricted Boltzmann Machines are used to extract words associated with image. Finally, dependency tree relations are used to generate sentence-captions by using the dependency on root words. Our approach is robust and attains BLEU-1 score of 34.8. With the increase in number of devices with cameras, there is a widespread interest in generating automatic captions from images and videos. Generating image description have huge impact in the fields of information retrieval, accessibility for the vision impaired, categorization of images etc. Additionally, generation automatic descriptions of images can be used as a frame by frame approach to describe videos and explain their context. Automatic generation of image descriptions is a widely researched problem. However, most visual recognition models and approaches in this fields are focused on Western languages, ignoring Semitic and Middle-Eastern languages like Arabic, Hebrew, Urdu and Persian. As discussed further in related works, almost all major caption generation models have validated their approaches using English. This is primarily due to the significant dialects between different forms of Arabic and the challenges in translating images to natural sounding sentences. Arabic is ranked as the fifth most native language among the population. Furthermore, Arabic has tremendous impact on the social and political aspects in the current community and is listed as one of the six official languages of the United Nations. Given the high influence of Arabic, it is necessary for a robust approach to generate captions of images in Arabic. In this paper, we propose a three-stage root word based for generation of captions in Arabic for images. Briefly, we first create fragments of images using a previously trained deep neural network on ImageNet. However, unlike other published approaches for English caption generation (Socher et Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. al. 2014), (Karpathy and Fei-Fei 2015), (Vinyals et al. 2015), we map these fragments to a set of root words in Arabic rather than actual words or sentences in English. We used deep belief networks pre-trained by Restricted Boltzmann Machines to select different root words associated with the image fragments and extract the most appropriate words for the image. A rank based approach is used to select the best image-sentence pairing from other false image sentences pairings. Finally, we use dependency tree relations to create sentence captions from the obtained words. Our main contribution in this paper is three-fold: • Mapping of image fragments onto root words in Arabic rather than actual sentences or words/fragments of sentences as suggested in previously proposed approaches. • Finding most appropriate words for an image by choosing set of vowels required to be added to the root words using Deep Learning • Using dependency tree relations of these obtained words to finally form sentences in Arabic To the best of our knowledge, this is the first work that leverage root words dependency relation to generate captions in Arabic. We rely on previously published approaches specifically ImageNet and Caffe on object detection to extract features from the images. For the purpose of clarity, we use the term root-words throughout this paper to represent the roots of an Arabic word.

[1]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[2]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.