Learning Goals from Failure

We introduce a framework that predicts the goals behind observable human action in video. Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show our trained model is able to predict the underlying goals in video of unintentional action. We also propose a method to "automatically correct" unintentional action by leveraging gradient signals of our model to adjust latent trajectories. Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video.

[1]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[2]  Li Fei-Fei,et al.  HiDDeN: Hiding Data With Deep Networks , 2018, ECCV.

[3]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[4]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[5]  James Bailey,et al.  Black-box Adversarial Attacks on Video Recognition Models , 2019, ACM Multimedia.

[6]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Phillip Isola,et al.  On the "steerability" of generative adversarial networks , 2019, ICLR.

[9]  Julio Hernandez-Castro,et al.  No Bot Expects the DeepCAPTCHA! Introducing Immutable Adversarial Examples, With Applications to CAPTCHA Generation , 2017, IEEE Transactions on Information Forensics and Security.

[10]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  A. Woodward Infants' Grasp of Others' Intentions , 2009, Current directions in psychological science.

[12]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[16]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[17]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[18]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[20]  Pieter Abbeel,et al.  Learning Plannable Representations with Causal InfoGAN , 2018, NeurIPS.

[21]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[22]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Boyuan Chen,et al.  Oops! Predicting Unintentional Action in Video , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[25]  Xiaofeng Wang,et al.  Invisible Mask: Practical Attacks on Face Recognition with Infrared , 2018, ArXiv.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[27]  Juan Carlos Niebles,et al.  Procedure Planning in Instructional Videos , 2019, ECCV.

[28]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  J. J. Guajardo,et al.  How infants make sense of intentional action , 2001 .

[30]  Shin Ishii,et al.  Distributional Smoothing with Virtual Adversarial Training , 2015, ICLR 2016.

[31]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.

[32]  Alison Gopnik,et al.  Toddlers' understanding of intentions, desires and emotions: Explorations of the dark ages. , 1999 .

[33]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[34]  A. Meltzoff Understanding the Intentions of Others: Re-Enactment of Intended Acts by 18-Month-Old Children. , 1995, Developmental psychology.

[35]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Xiaochun Cao,et al.  Transferable Adversarial Attacks for Image and Video Object Detection , 2018, IJCAI.

[38]  Intentions L. Woodward Infants' Grasp of Others' , 2009 .

[39]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Thomas R. Shultz,et al.  Development of the ability to distinguish intended actions from mistakes, reflexes, and passive movements , 1980 .

[41]  Junfeng Yang,et al.  NEUZZ: Efficient Fuzzing with Neural Program Smoothing , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[42]  M. Tomasello,et al.  Understanding and sharing intentions: The origins of cultural cognition , 2005, Behavioral and Brain Sciences.

[43]  Davide Mate,et al.  Children's understanding of intentional vs. non-intentional action , 2007 .

[44]  Dare A. Baldwin,et al.  Intentions and Intentionality: Foundations of Social Cognition , 2001 .

[45]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[47]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  C. Moore,et al.  Intentional relations and social understanding , 1996, Behavioral and Brain Sciences.

[50]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[51]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[52]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[53]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Dima Damen,et al.  Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  M. Tomasello The usage-based theory of language acquisition , 2009 .

[56]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[57]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[60]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[61]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[62]  George Danezis,et al.  Generating steganographic images via adversarial training , 2017, NIPS.

[63]  Sarah A. Gerson,et al.  The emergence of intention attribution in infancy. , 2009, The psychology of learning and motivation.

[64]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[65]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[66]  Tae-Hyun Oh,et al.  Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).