Attention Based Multi-Modal Fusion Architecture for Open-Ended Video Question Answering Systems
暂无分享,去创建一个
Abstract Open-ended Video Question Answering systems is a very challenging problem with widespread applications in real life. Existing systems tend to focus on single word video question answering system, which cannot be easily extended to develop. In this paper, we propose using an architecture, popularly used for video captioning systems to solve the problem of open-ended video based question answering systems. For generating good answers, the model is required to focus on each frame separately as well as understand how to link information from different frames to generate the answer. The model also needs to keep in mind the different modalities and adapt itself accordingly while processing the videos as well as the questions. We propose an attention based multimodal fusion architecture for Video Question Answering (AMF-VQA) that uses attention mechanism at every time to output a word. Such kind of mechanism allows the model to focus on different frames as well as focus on different modalities while outputting every single word. The proposed model is very flexible were we can just add other modalities such as audio features, captions, etc. to the existing model and fine-tune the model to get improve results if these new features are available.
[1] Deng Cai,et al. Unifying the Video and Question Attentions for Open-Ended Video Question Answering , 2017, IEEE Transactions on Image Processing.