Self-Supervised Keypoint Discovery in Behavioral Videos

We propose a method for learning the posture and structure of agents from unlabelled behavioral videos. Starting from the observation that behaving agents are generally the main sources of movement in behavioral videos, our method, Behavioral Keypoint Discovery (B-KinD), uses an encoder-decoder architecture with a geometric bottleneck to reconstruct the spatiotemporal difference between video frames. By focusing only on regions of movement, our approach works directly on input videos without requiring manual annotations. Experiments on a variety of agent types (mouse, fly, human, jellyfish, and trees) demonstrate the generality of our approach and reveal that our discovered keypoints represent semantically meaningful body parts, which achieve state-of-the-art performance on key-point regression among self-supervised methods. Additionally, B-KinD achieve comparable performance to supervised keypoints on downstream tasks, such as behavior classification, suggesting that our method can dramatically reduce model training costs vis-a-vis supervised methods.

[1]  Jennifer L. Cardona,et al.  Wind speed inference from environmental flow–structure interactions. Part 2. Leveraging unsteady kinematics , 2022, Flow.

[2]  S. Remy,et al.  Identifying behavioral structure from deep variational embeddings of animal motion , 2020, bioRxiv.

[3]  Chiew-Lan Tai,et al.  Normalized Human Pose Features for Human Action Video Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Pietro Perona,et al.  Weakly Supervised Keypoint Discovery , 2021, ArXiv.

[5]  B-SOiD, an open-source unsupervised algorithm for identification and fast prediction of behaviors , 2021, Nature communications.

[6]  Bernhard Kainz,et al.  Unsupervised Human Pose Estimation through Transforming Shape Templates , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Pietro Perona,et al.  The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions , 2021, NeurIPS Datasets and Benchmarks.

[8]  Pietro Perona,et al.  Task Programming: Learning Data Efficient Behavior Representations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jennifer L. Cardona,et al.  Wind speed inference from environmental flow–structure interactions , 2020, Flow.

[10]  Pietro Perona,et al.  The Mouse Action Recognition System (MARS) software pipeline for automated analysis of social behaviors in mice , 2020, bioRxiv.

[11]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Talmo D. Pereira,et al.  Quantifying behavior to understand the brain , 2020, Nature Neuroscience.

[13]  Joshua W. Shaevitz,et al.  SLEAP: Multi-animal pose tracking , 2020, bioRxiv.

[14]  Kelsey N. Lucas,et al.  The Hydrodynamics of Jellyfish Swimming. , 2020, Annual review of marine science.

[15]  Hakan Bilen,et al.  Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Neir Eshel,et al.  Simple Behavioral Analysis (SimBA) – an open source toolkit for computer classification of complex social behaviors in experimental animals , 2020, bioRxiv.

[17]  Ting Liu,et al.  View-Invariant Probabilistic Embedding for Human Pose , 2019, ECCV.

[18]  HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Steven L. Brunton,et al.  Discovery of Physics From Data: Universal Laws and Discrepancies , 2019, Frontiers in Artificial Intelligence.

[20]  Seonghyeon Nam,et al.  Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction , 2019, NeurIPS.

[21]  Simon Stock,et al.  DeepBees - Building and Scaling Convolutional Neuronal Nets For Fast and Large-Scale Visual Monitoring of Bee Hives , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[22]  Seong-Gyun Jeong,et al.  Anchor Loss: Modulating Loss Scale Based on Prediction Difficulty , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Chen Sun,et al.  Unsupervised Learning of Object Structure and Dynamics from Videos , 2019, NeurIPS.

[24]  Jennifer L Cardona,et al.  Seeing the Wind: Visual Wind Speed Prediction with a Coupled Convolutional and Recurrent Neural Network , 2019, NeurIPS.

[25]  Jacob M. Graving,et al.  DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning , 2019, bioRxiv.

[26]  Björn Ommer,et al.  Unsupervised Part-Based Disentangling of Object Shape and Appearance , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[28]  Kevin M. Cury,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[29]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[30]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Andrea Vedaldi,et al.  Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Pietro Perona,et al.  Learning recurrent representations for hierarchical behavior modeling , 2016, ICLR.

[34]  Kristin Branson,et al.  Computational Analysis of Behavior. , 2016, Annual review of neuroscience.

[35]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[36]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[37]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ryan P. Adams,et al.  Mapping Sub-Second Structure in Mouse Behavior , 2015, Neuron.

[40]  David J. Anderson,et al.  Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning , 2015, Proceedings of the National Academy of Sciences.

[41]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  David J. Anderson,et al.  Toward a Science of Computational Ethology , 2014, Neuron.

[44]  Jonathan Schor,et al.  Detecting Social Actions of Fruit Flies , 2014, ECCV.

[45]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  William Bialek,et al.  Mapping the stereotyped behaviour of freely moving fruit flies , 2013, Journal of The Royal Society Interface.

[47]  P. Perona,et al.  utomated multi-day tracking of marked mice for the analysis of ocial behaviour , 2013 .

[48]  Kristin Branson,et al.  JAABA: interactive machine learning for automatic annotation of animal behavior , 2013, Nature Methods.

[49]  Pietro Perona,et al.  Social behavior recognition in continuous video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Thomas Serre,et al.  Automated home-cage behavioural phenotyping of mice. , 2010, Nature communications.

[51]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[52]  Pietro Perona,et al.  High-throughput Ethomics in Large Groups of Drosophila , 2009, Nature Methods.

[53]  Pietro Perona,et al.  Automated monitoring and analysis of social behavior in Drosophila , 2009, Nature Methods.

[54]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.