Self-Supervised Keypoint Discovery in Behavioral Videos

We propose a method for learning the posture and structure of agents from unlabelled behavioral videos. Starting from the observation that behaving agents are generally the main sources of movement in behavioral videos, our method uses an encoder-decoder architecture with a geometric bottleneck to reconstruct the difference between video frames. By focusing only on regions of movement, our approach works directly on input videos without requiring manual annotations, such as keypoints or bounding boxes. Experiments on a variety of agent types (mouse, fly, human, jellyfish, and trees) demonstrate the generality of our approach and reveal that our discovered keypoints represent semantically meaningful body parts, which achieve stateof-the-art performance on keypoint regression among selfsupervised methods. Additionally, our discovered keypoints achieve comparable performance to supervised keypoints *Equal contribution. on downstream tasks, such as behavior classification, suggesting that our method can dramatically reduce the cost of model training vis-a-vis supervised methods.

[1]  Bernhard Kainz,et al.  Unsupervised Human Pose Estimation through Transforming Shape Templates , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jacob M. Graving,et al.  DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning , 2019, bioRxiv.

[3]  Pietro Perona,et al.  The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice , 2020, bioRxiv.

[4]  Jonathan Schor,et al.  Detecting Social Actions of Fruit Flies , 2014, ECCV.

[5]  Jennifer L. Cardona,et al.  Wind speed inference from environmental flow–structure interactions , 2020, Flow.

[6]  Steven L. Brunton,et al.  Discovery of Physics From Data: Universal Laws and Discrepancies , 2019, Frontiers in Artificial Intelligence.

[7]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Pietro Perona,et al.  Automated monitoring and analysis of social behavior in Drosophila , 2009, Nature Methods.

[9]  Thomas Serre,et al.  Automated home-cage behavioural phenotyping of mice. , 2010, Nature communications.

[10]  Florian Schroff,et al.  View-Invariant Probabilistic Embedding for Human Pose , 2020, ECCV.

[11]  Pietro Perona,et al.  Weakly Supervised Keypoint Discovery , 2021, ArXiv.

[12]  Kristin Branson,et al.  JAABA: interactive machine learning for automatic annotation of animal behavior , 2013, Nature Methods.

[13]  Pietro Perona,et al.  Social behavior recognition in continuous video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[15]  Ankush Gupta,et al.  Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[17]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  S. Remy,et al.  Identifying behavioral structure from deep variational embeddings of animal motion , 2020, bioRxiv.

[19]  Kelsey N. Lucas,et al.  The Hydrodynamics of Jellyfish Swimming. , 2020, Annual review of marine science.

[20]  David J. Anderson,et al.  Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning , 2015, Proceedings of the National Academy of Sciences.

[21]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[23]  Kevin M. Cury,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[24]  P. Perona,et al.  utomated multi-day tracking of marked mice for the analysis of ocial behaviour , 2013 .

[25]  Kristin Branson,et al.  Computational Analysis of Behavior. , 2016, Annual review of neuroscience.

[26]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[27]  Andrea Vedaldi,et al.  Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Chiew-Lan Tai,et al.  Normalized Human Pose Features for Human Action Video Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Seong-Gyun Jeong,et al.  Anchor Loss: Modulating Loss Scale Based on Prediction Difficulty , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Jennifer L Cardona,et al.  Seeing the Wind: Visual Wind Speed Prediction with a Coupled Convolutional and Recurrent Neural Network , 2019, NeurIPS.

[32]  B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors , 2021, Nature communications.

[33]  Simon Stock,et al.  DeepBees - Building and Scaling Convolutional Neuronal Nets For Fast and Large-Scale Visual Monitoring of Bee Hives , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[34]  Björn Ommer,et al.  Unsupervised Part-Based Disentangling of Object Shape and Appearance , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  David J. Anderson,et al.  Toward a Science of Computational Ethology , 2014, Neuron.

[36]  Pietro Perona,et al.  High-throughput Ethomics in Large Groups of Drosophila , 2009, Nature Methods.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  William Bialek,et al.  Mapping the stereotyped behaviour of freely moving fruit flies , 2013, Journal of The Royal Society Interface.

[40]  Pietro Perona,et al.  The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions , 2021, NeurIPS Datasets and Benchmarks.

[41]  Ryan P. Adams,et al.  Mapping Sub-Second Structure in Mouse Behavior , 2015, Neuron.

[42]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[43]  Chen Sun,et al.  Unsupervised Learning of Object Structure and Dynamics from Videos , 2019, NeurIPS.

[44]  Pietro Perona,et al.  Task Programming: Learning Data Efficient Behavior Representations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pietro Perona,et al.  Learning recurrent representations for hierarchical behavior modeling , 2016, ICLR.

[46]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[47]  Talmo D. Pereira,et al.  Quantifying behavior to understand the brain , 2020, Nature Neuroscience.

[48]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Neir Eshel,et al.  Simple Behavioral Analysis (SimBA) – an open source toolkit for computer classification of complex social behaviors in experimental animals , 2020, bioRxiv.

[50]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Joshua W. Shaevitz,et al.  SLEAP: Multi-animal pose tracking , 2020, bioRxiv.

[52]  Seonghyeon Nam,et al.  Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction , 2019, NeurIPS.

[53]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).