Video Analysis and Natural Language Description Generation System

The project revolves around the idea of scene understanding purpose based on the video input, thus not continuously monitoring the feed manually. The videos are extracted into the form of raw video frames and using 2D-3D CNN, the feature vector is extracted. Using You Only Look Once - version 3 (YOLOv3) algorithm, the objects present in a particular frame is identified. Also, the count of the objects is stored. The pose of people present in the frames is estimated for identification of movements. Through this, the actions are recognized as being performed by the people. All the words that are formed through the above three methods count to input to the LSTM cell. This cell selects the words based on their probabilities and confidence rate and forms a natural language sentence for the user to understand. Finally, the generated output can be modified or changed completely by the user using Human-in-the-loop concept, if required. The machine will retrain itself based on this input and generate better results next time. The central model is capable of identifying as well as discriminating between types of elements which are required for this project. This project was built as a continuation of the previous system, which works on object identification from live video input from drones. In the case of poor network issues, when sending video data becomes difficult, the data is sent in textual format.

[1]  Jitendra Malik,et al.  The three R's of computer vision: Recognition, reconstruction and reorganization , 2016, Pattern Recognit. Lett..

[2]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[3]  Francine Chen,et al.  Video to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks , 2017, BMVC.

[4]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[5]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[9]  Ting Yao,et al.  Deep Learning for Video Classification and Captioning , 2016, Frontiers of Multimedia Research.

[10]  Ahmet Aker,et al.  Generating Image Descriptions Using Dependency Relational Patterns , 2010, ACL.

[11]  Fabio Massimo Zanzotto Human-in-the-loop Artificial Intelligence , 2017, ArXiv.

[12]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Lu Yang,et al.  Human-in-the-loop reinforcement learning , 2017, 2017 Chinese Automation Congress (CAC).

[14]  Andreas Holzinger,et al.  Interactive machine learning: experimental evidence for the human in the algorithmic loop , 2018, Applied Intelligence.

[15]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.