Learning Goals from Failure.

We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to "automatically correct" unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video. Project page: https://aha.cs.columbia.edu/

[1]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[3]  Boyuan Chen,et al.  Oops! Predicting Unintentional Action in Video , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Thomas R. Shultz,et al.  Development of the ability to distinguish intended actions from mistakes, reflexes, and passive movements , 1980 .

[5]  Tae-Hyun Oh,et al.  Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[7]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[8]  Xiaofeng Wang,et al.  Invisible Mask: Practical Attacks on Face Recognition with Infrared , 2018, ArXiv.

[9]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[11]  Li Fei-Fei,et al.  HiDDeN: Hiding Data With Deep Networks , 2018, ECCV.

[12]  George Danezis,et al.  Generating steganographic images via adversarial training , 2017, NIPS.

[13]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[14]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[15]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Julio Hernandez-Castro,et al.  No Bot Expects the DeepCAPTCHA! Introducing Immutable Adversarial Examples, With Applications to CAPTCHA Generation , 2017, IEEE Transactions on Information Forensics and Security.

[18]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[21]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[25]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[27]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[28]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[29]  J. J. Guajardo,et al.  How infants make sense of intentional action , 2001 .

[30]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[31]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[33]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[34]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[35]  M. Tomasello,et al.  Understanding and sharing intentions: The origins of cultural cognition , 2005, Behavioral and Brain Sciences.

[36]  Alison Gopnik,et al.  Toddlers' understanding of intentions, desires and emotions: Explorations of the dark ages. , 1999 .

[37]  Phillip Isola,et al.  On the "steerability" of generative adversarial networks , 2019, ICLR.

[38]  Intentions L. Woodward Infants' Grasp of Others' , 2009 .

[39]  Sarah A. Gerson,et al.  The emergence of intention attribution in infancy. , 2009, The psychology of learning and motivation.

[40]  James Bailey,et al.  Black-box Adversarial Attacks on Video Recognition Models , 2019, ACM Multimedia.

[41]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[42]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  A. Meltzoff Understanding the Intentions of Others: Re-Enactment of Intended Acts by 18-Month-Old Children. , 1995, Developmental psychology.

[44]  Xiaochun Cao,et al.  Transferable Adversarial Attacks for Image and Video Object Detection , 2018, IJCAI.

[45]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.

[47]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[48]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  C. Moore,et al.  Intentional relations and social understanding , 1996, Behavioral and Brain Sciences.

[51]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Dima Damen,et al.  Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.