Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Video question answering has recently received a lot of attention from multimodal video researchers. Most video question answering datasets are usually in the form of multiple-choice. But, the model for the multiple-choice task does not infer the answer. Rather it compares the answer candidates for picking the correct answer. Furthermore, it makes it difficult to extend to other tasks. In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. The model is fine-tuned with video inputs and subtitles. An ablation study is performed by changing the existing DramaQA dataset to an openended question answering, and it shows that performance can be improved using video metadata.

[1]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[2]  Chenhui Chu,et al.  BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[4]  Deng Cai,et al.  Unifying the Video and Question Attentions for Open-Ended Video Question Answering , 2017, IEEE Transactions on Image Processing.

[5]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zhou Yu,et al.  Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[9]  Jie Zhou,et al.  Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog , 2021 .

[10]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[13]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[14]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[19]  Seongho Choi,et al.  DramaQA: Character-Centered Video Story Understanding with Hierarchical QA , 2021, AAAI.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[23]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).