SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Traffic event cognition and reasoning in videos is an important task that has a wide range of applications in intelligent transportation, assisted driving, and autonomous vehicles. In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, we propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events. Moreover, we propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning. The experiments show that our method achieves superior performance while reducing the computation cost significantly. The project page: https://github.com/SUTDCV/SUTD-TrafficQA.

[1]  Xinlei Chen,et al.  In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Bohyung Han,et al.  MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Xiaogang Wang,et al.  Sparsifying Neural Network Connections for Face Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chang D. Yoo,et al.  Modality Shifting Attention Network for Multi-Modal Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Chuang Gan,et al.  Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.

[7]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[8]  Yuta Nakashima,et al.  Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions , 2020, ECCV.

[9]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Li Zhang,et al.  Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Mark J. Weal,et al.  Skim reading: an adaptive strategy for reading on the web , 2014, WebSci '14.

[14]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[15]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[16]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[17]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[18]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Chenhui Chu,et al.  BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[21]  Fei-Yue Wang,et al.  Data-Driven Intelligent Transportation Systems: A Survey , 2011, IEEE Transactions on Intelligent Transportation Systems.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Uttaran Bhattacharya,et al.  Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs , 2020, IEEE Robotics and Automation Letters.

[24]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[26]  Deng Cai,et al.  Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[27]  Yunhong Wang,et al.  Led 3 D : A Lightweight and Efficient Deep Approach to Recognizing Low-quality 3 D Faces , 2019 .

[28]  Larry S. Davis,et al.  AdaFrame: Adaptive Frame Selection for Fast Video Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Luca Benini,et al.  Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[30]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[31]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[32]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[34]  Yahong Han,et al.  Explore Multi-Step Reasoning in Video Question Answering , 2018, CoVieW@MM.

[35]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Licheng Yu,et al.  TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[38]  Louis-Philippe Morency,et al.  Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mitesh M. Khapra,et al.  Efficient Video Classification Using Fewer Frames , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[44]  Xin Wang,et al.  SkipNet: Learning Dynamic Routing in Convolutional Networks , 2017, ECCV.

[45]  Yao Wang,et al.  Adaptive Computationally Efficient Network for Monocular 3D Hand Pose Estimation , 2020, ECCV.

[46]  Ali Farhadi,et al.  Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Stephen J. Payne,et al.  Skim reading by satisficing: evidence from eye tracking , 2011, CHI.

[50]  Larry S. Davis,et al.  A Coarse-to-Fine Framework for Resource Efficient Video Recognition , 2019, International Journal of Computer Vision.

[51]  Chenhui Chu,et al.  KnowIT VQA: Answering Knowledge-Based Questions about Videos , 2020, AAAI.

[52]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[53]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[54]  Shu Zhang,et al.  Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[56]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[58]  Ling-Yu Duan,et al.  VERI-Wild: A Large Dataset and a New Method for Vehicle Re-Identification in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Di Huang,et al.  Led3D: A Lightweight and Efficient Deep Approach to Recognizing Low-Quality 3D Faces , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[61]  Tegan Maharaj,et al.  A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Truyen Tran,et al.  Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[64]  Bohyung Han,et al.  Traffic Accident Benchmark for Causality Recognition , 2020, European Conference on Computer Vision.

[65]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).