Vocabulary Learning Support System Based on Automatic Image Captioning Technology

Learning context has evident to be an essential part in vocabulary development, however describing learning context for each vocabulary is considered to be difficult. In the human brain, it is relatively easy to describe learning contexts using pictures because pictures describe an immense amount of details at a quick glance that text annotations cannot do. Therefore, in an informal language learning system, pictures can be used to overcome the problems that language learners face in describing learning contexts. The present study aimed to develop a support system that generates and represents learning contexts automatically by analyzing the visual contents of the pictures captured by language learners. Automatic image captioning, a technology of artificial intelligence that connects computer vision and natural language processing is used for analyzing the visual contents of the learners’ captured images. A neural image caption generator model called Show and Tell is trained for image-to-word generation and to describe the context of an image. The three-fold objectives of this research are: First, an intelligent technology that can understand the contents of the picture and capable to generate learning contexts automatically; Second, a leaner can learn multiple vocabularies by using one picture without relying on a representative picture for each vocabulary, and Third, a learner’s prior vocabulary knowledge can be mapped with new learning vocabulary so that previously acquired vocabulary be reviewed and recalled while learning new vocabulary.

[1]  Trude Heift,et al.  Error-specific and individualised feedback in a Web-based language tutoring system: Do they read it? , 2001, ReCALL.

[2]  Chang Liu,et al.  Image2Text: A Multimodal Image Captioner , 2016, ACM Multimedia.

[3]  The Importance of Contextual Situation in Language Teaching , 2008 .

[4]  P. Gu Vocabulary Learning in a Second Language: Person, Task, Context and Strategies. , 2003 .

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[6]  Juan Francisco Coll,et al.  Richness of semantic encoding in a hypermedia-assisted instructional environment for ESP: effects on incidental vocabulary retention among learners with low ability in the target language , 2002, ReCALL.

[7]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Shuang Bai,et al.  A survey on automatic image caption generation , 2018, Neurocomputing.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[12]  Changhu Wang,et al.  Image2Text: A Multimodal Caption Generator , 2016 .

[13]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[14]  R. Sternberg Most vocabulary is learned from context. , 1987 .

[15]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[16]  Glenn Stockwell A review of technology choice for teaching language skills and areas in the CALL literature , 2007, ReCALL.

[17]  Yuli Yeh,et al.  Effects of Multimedia Vocabulary Annotations and Learning Styles on Vocabulary Learning , 2013 .