An Automatic Caption Generation for video clip with reducing frames in order to shorten processing time

The conventional method used entire frames which sampled at regular intervals from video clip to generate text descriptions [1]. Almost generating description methods are using deep learning techniques, such as encoder-decoder framework [2] [3]. Therefore, a learning and text generation take a long time if video has a lot of frames. In other words, processing time is increasing with the length of the video clip. There is a method using keyframe of video as conventional researches in order to reduce processing frames to generate explanation texts. But this method does not consider time step variation of video clips. Video is a set of consecutive images, but it is clearly different from simple image because video has information of time step. We assume that it is important for shortening process time that is not only reducing processing frames but also considering time steps in a video clip. Humans can create explanation texts only using some of frames not include whole of frames. In particular, we can recognize the content using only few frames if movie is short like which is uploaded on Vine and Twitter. For this reason, we propose a new method for reducing process frame with sustaining semantics and compared proposed method and conventional ones on score and processing time with doing evaluation experiment.