CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification methods often still remain competitive. We posit that current video datasets are plagued with implicit biases over scene and object structure that can dwarf variations in temporal structure. In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved. Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning. In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. Using CATER, we provide insights into some of the most recent state of the art deep video architectures.

[1]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Catherine Achard,et al.  The DAily Home LIfe Activity Dataset: A High Semantic Activity Dataset for Online Recognition , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[4]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[5]  Carl Doersch,et al.  Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[6]  Massimo Massa,et al.  A New Method for Real Time Abandoned Object Detection and Owner Tracking , 2006, 2006 International Conference on Image Processing.

[7]  Larry S. Davis,et al.  VidMAP: video monitoring of activity with Prolog , 2005, IEEE Conference on Advanced Video and Signal Based Surveillance, 2005..

[8]  Anoop Cherian,et al.  Human Pose Forecasting via Deep Markov Models , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[9]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[12]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[13]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[17]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[19]  Ross B. Girshick,et al.  PHYRE: A New Benchmark for Physical Reasoning , 2019, NeurIPS.

[20]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[21]  Larry S. Davis,et al.  Learning to Detect Carried Objects with Minimal Supervision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[23]  Javier Sánchez Pérez,et al.  TV-L1 Optical Flow Estimation , 2013, Image Process. Line.

[24]  C. Lawrence Zitnick,et al.  Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[26]  Wei Wu,et al.  Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[27]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[28]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Emmanuel Dupoux,et al.  IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning , 2018, ArXiv.

[31]  Abhinav Gupta,et al.  What Actions are Needed for Understanding Human Actions in Videos? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Antonio Manuel López Peña,et al.  Procedural Generation of Videos to Train Deep Action Recognition Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yutaka Satoh,et al.  Human Action Recognition Without Human , 2016, ECCV Workshops.

[34]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jitendra Malik,et al.  From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[37]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[38]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiao Liu,et al.  Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification , 2017, ArXiv.

[42]  Ehud Rivlin,et al.  Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Yi Li,et al.  RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  A F Bobick,et al.  Movement, activity and action: the role of knowledge in the perception of motion. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[46]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[51]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[52]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Rogério Schmidt Feris,et al.  Robust Detection of Abandoned and Removed Objects in Complex Surveillance Videos , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ashish Kapoor,et al.  AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles , 2017, FSR.

[57]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[58]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[59]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[60]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[61]  Ramakant Nevatia,et al.  Video-based event recognition: activity representation and probabilistic recognition methods , 2004, Comput. Vis. Image Underst..

[62]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Yahong Han,et al.  Explore Multi-Step Reasoning in Video Question Answering , 2018, CoVieW@MM.

[66]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[68]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[69]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[70]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[71]  Christian Wolf,et al.  COPHY: Counterfactual Learning of Physical Dynamics , 2020, ICLR.

[72]  Viorica Patraucean,et al.  Sideways: Depth-Parallel Training of Video Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[76]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[77]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[78]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[80]  Y. Shoham Reasoning About Change: Time and Causation from the Standpoint of Artificial Intelligence , 1987 .

[81]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[82]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[84]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[85]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.