论文信息 - Unsupervised Discovery of Parts, Structure, and Dynamics

Unsupervised Discovery of Parts, Structure, and Dynamics

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.

[1] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[2] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[3] Sungroh Yoon,et al. MSnet: Mutual Suppression Network for Disentangled Video Representations , 2018, ArXiv.

[4] E S Spelke,et al. Core knowledge. , 2000, The American psychologist.

[5] Demis Hassabis,et al. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts , 2017, ArXiv.

[6] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Joshua B. Tenenbaum,et al. A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[8] Ce Liu,et al. Exploring new representations and applications for motion analysis , 2009 .

[9] Joshua B. Tenenbaum,et al. Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] R. Zemel,et al. Neural Relational Inference for Interacting Systems , 2018, ICML.

[11] Chenliang Xu,et al. Streaming Hierarchical Video Segmentation , 2012, ECCV.

[12] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[13] Jiajun Wu,et al. Learning to See Physics via Visual De-animation , 2017, NIPS.

[14] Jitendra Malik,et al. Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[15] Jiajun Wu,et al. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[16] Jitendra Malik,et al. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[17] Mei Han,et al. Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18] S. Carey. The Origin of Concepts , 2000 .

[19] Antonis A. Argyros,et al. Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Koray Kavukcuoglu,et al. Multiple Object Recognition with Visual Attention , 2014, ICLR.

[21] Jitendra Malik,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[22] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23] Niloy J. Mitra,et al. Learning A Physical Long-term Predictor , 2017, ArXiv.

[24] Kun Zhou,et al. Imagining the unseen , 2014, ACM Trans. Graph..

[25] Kevin Murphy,et al. Efficient inference in occlusion-aware generative models of images , 2015, ArXiv.

[26] Razvan Pascanu,et al. Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[27] G. Johansson. Visual perception of biological motion and a model for its analysis , 1973 .

[28] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[29] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[30] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[31] Joshua B. Tenenbaum,et al. Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[32] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33] Richard S. Zemel,et al. Learning Parts-Based Representations of Data , 2006, J. Mach. Learn. Res..

[34] Edward H. Adelson,et al. Layered representation for motion analysis , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[35] Stephan Mandt,et al. A Deep Generative Model for Disentangled Representations of Sequential Data , 2018, ICML 2018.

[36] Max Jaderberg,et al. Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[37] Raquel Urtasun,et al. Physically-based motion models for 3D tracking: A convex formulation , 2011, 2011 International Conference on Computer Vision.

[38] Richard S. Zemel,et al. Learning Articulated Structure and Motion , 2010, International Journal of Computer Vision.

[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40] Geoffrey E. Hinton,et al. Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[41] Jürgen Schmidhuber,et al. Neural Expectation Maximization , 2017, NIPS.

[42] Yu Zhang,et al. Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[43] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[44] Odest Chadwicke Jenkins,et al. Dynamical Simulation Priors for Human Motion Tracking , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Frédo Durand,et al. Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] David J. Fleet,et al. Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[47] Allan D. Jepson,et al. Fast Rigid Motion Segmentation via Incrementally-Complex Local Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Samuel J. Gershman,et al. Discovering hierarchical motion structure , 2016, Vision Research.

[49] Jürgen Schmidhuber,et al. Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[50] Razvan Pascanu,et al. Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[51] Patrick Bouthemy,et al. Discovering motion hierarchies via tree-structured coding of trajectories , 2016, BMVC.

[52] Geoffrey E. Hinton,et al. Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[53] Pieter Abbeel,et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[54] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[56] Ming-Hsuan Yang,et al. Flow-Grounded Spatial-Temporal Video Prediction from Still Images , 2018, ECCV.

[57] Roger B. Grosse,et al. Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.