论文信息 - Oops! Predicting Unintentional Action in Video

Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person's action is intentional or not. Can we train a model to recognize this? We introduce a dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset. We train a supervised neural network as a baseline and analyze its performance compared to human consistency on the tasks. We also investigate self-supervised representations that leverage natural signals in our dataset, and show the effectiveness of an approach that uses the intrinsic speed of video to perform competitively with highly-supervised pretraining. However, a significant gap between machine and human performance remains.

[1] Alexander Kolesnikov,et al. Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[3] Guangchun Cheng,et al. Advances in Human Action Recognition: A Survey , 2015, ArXiv.

[4] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Yunde Jia,et al. Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[6] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[7] Zachary C. Burns,et al. Slow motion increases perceived intent , 2016, Proceedings of the National Academy of Sciences.

[8] Wojciech Matusik,et al. Gaze360: Physically Unconstrained Gaze Estimation in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Antonio Torralba,et al. Where are they looking? , 2015, NIPS.

[10] Amanda C. Brandone,et al. You Can't Always Get What You Want , 2009, Psychological science.

[11] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[12] Quan Z. Sheng,et al. Online human gesture recognition from motion data streams , 2013, ACM Multimedia.

[13] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16] Gregory Shakhnarovich,et al. Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jitendra Malik,et al. View Synthesis by Appearance Flow , 2016, ECCV.

[18] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Yang Wang,et al. Back to the Future: Knowledge Distillation for Human Action Anticipation , 2019, ArXiv.

[20] Qing Lei,et al. A Comprehensive Survey of Vision-Based Human Action Recognition Methods , 2019, Sensors.

[21] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[22] A. Woodward. Infants' ability to distinguish between purposeful and non-purposeful behaviors , 1999 .

[23] Jonathan Tompson,et al. Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Wonjun Hwang,et al. Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction , 2020, ArXiv.

[25] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Michael E. Bratman,et al. Intention, Plans, and Practical Reason , 1991 .

[28] Andrew Zisserman,et al. Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Sergio Guadarrama,et al. Tracking Emerges by Colorizing Videos , 2018, ECCV.

[30] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[31] Rémi Ronfard,et al. A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[32] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] David A. Forsyth,et al. Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[34] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[36] Xiaoou Tang,et al. Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[37] Gang Yu,et al. Predicting human activities using spatio-temporal structure of interest points , 2012, ACM Multimedia.

[38] Michael S. Ryoo,et al. Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[39] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[40] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[41] Xueting Li,et al. Joint-task Self-supervised Learning for Temporal Correspondence , 2019, NeurIPS.

[42] A. Woodward. Infants' Grasp of Others' Intentions , 2009, Current directions in psychological science.

[43] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[44] Deva Ramanan,et al. Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Jiajun Wu,et al. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[47] Ersin Yumer,et al. Self-supervised Learning of Motion Capture , 2017, NIPS.

[48] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[49] Charless C. Fowlkes,et al. The Open World of Micro-Videos , 2016, ArXiv.

[50] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[52] Kristen Grauman,et al. Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[55] Intentions L. Woodward. Infants' Grasp of Others' , 2009 .

[56] Thomas Brox,et al. Learning Representations for Predicting Future Activities , 2019, ArXiv.

[57] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[58] Thomas Brox,et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Zihang Lai,et al. Self-supervised Learning for Video Correspondence Flow , 2019, ArXiv.

[60] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[62] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[63] Fernando De la Torre,et al. Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[64] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[67] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Sergio Escalera,et al. A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[69] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[70] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[71] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[72] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[73] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[74] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[75] Bolei Zhou,et al. Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[77] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78] Ronald Poppe,et al. A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[79] Ronen Basri,et al. Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[81] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82] Richard P. Wildes,et al. Review of Action Recognition and Detection Methods , 2016, ArXiv.

[83] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[84] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.