Bridging the Gap to Real-World Object-Centric Learning

Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing image-based object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real-world datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.

[1]  Bo Yang,et al.  Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images , 2022, NeurIPS.

[2]  T. Brox,et al.  Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations , 2022, ICLR.

[3]  T. Brox,et al.  Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations , 2022, ICLR.

[4]  Sjoerd van Steenkiste,et al.  SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos , 2022, NeurIPS.

[5]  Sjoerd van Steenkiste,et al.  SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos , 2022, NeurIPS.

[6]  Wouter Van Gansbeke,et al.  Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation , 2022, ArXiv.

[7]  Wouter Van Gansbeke,et al.  Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation , 2022, ArXiv.

[8]  X. Zhang,et al.  Self-Supervised Visual Representation Learning with Semantic Grouping , 2022, NeurIPS.

[9]  X. Zhang,et al.  Self-Supervised Visual Representation Learning with Semantic Grouping , 2022, NeurIPS.

[10]  Gautam Singh,et al.  Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos , 2022, NeurIPS.

[11]  Gautam Singh,et al.  Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos , 2022, NeurIPS.

[12]  Martin Volker Butz,et al.  Learning What and Where - Unsupervised Disentangling Location and Identity Tracking , 2022, ArXiv.

[13]  Martin Volker Butz,et al.  Learning What and Where: Disentangling Location and Identity Tracking Without Supervision , 2022, ICLR.

[14]  A. Vedaldi,et al.  Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  A. Vedaldi,et al.  Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[17]  Lawson L. S. Wong,et al.  Binding Actions to Objects in World Models , 2022, ArXiv.

[18]  Lawson L. S. Wong,et al.  Binding Actions to Objects in World Models , 2022, ArXiv.

[19]  C. Rudin,et al.  SegDiscover: Visual Concept Discovery via Unsupervised Semantic Segmentation , 2022, ArXiv.

[20]  C. Rudin,et al.  SegDiscover: Visual Concept Discovery via Unsupervised Semantic Segmentation , 2022, ArXiv.

[21]  Michael G. Rabbat,et al.  Masked Siamese Networks for Label-Efficient Learning , 2022, ECCV.

[22]  M. Hebert,et al.  Discovering Objects that Can Move , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  W. Freeman,et al.  Unsupervised Semantic Segmentation by Distilling Feature Correspondences , 2022, ICLR.

[24]  Olivier J. H'enaff,et al.  Object discovery and representation networks , 2022, ECCV.

[25]  W. Freeman,et al.  Unsupervised Semantic Segmentation by Distilling Feature Correspondences , 2022, ICLR.

[26]  David J. Fleet,et al.  Kubric: A scalable dataset generator , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  David J. Fleet,et al.  Kubric: A scalable dataset generator , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  D. Vaufreydaz,et al.  Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  D. Vaufreydaz,et al.  Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shalini De Mello,et al.  GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shalini De Mello,et al.  GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  B. Schölkopf,et al.  Compositional Multi-Object Reinforcement Learning with Linear Relation Networks , 2022, ArXiv.

[33]  B. Schölkopf,et al.  Compositional Multi-Object Reinforcement Learning with Linear Relation Networks , 2022, ArXiv.

[34]  Gamaleldin F. Elsayed,et al.  Conditional Object-Centric Learning from Video , 2021, ICLR.

[35]  Gamaleldin F. Elsayed,et al.  Conditional Object-Centric Learning from Video , 2021, ICLR.

[36]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Gautam Singh,et al.  Illiterate DALL-E Learns to Compose , 2021, ICLR.

[39]  Gautam Singh,et al.  Illiterate DALL-E Learns to Compose , 2021, ICLR.

[40]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[41]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[42]  Michele De Vita,et al.  Generalization and Robustness Implications in Object-Centric Learning , 2021, ICML.

[43]  Michele De Vita,et al.  Generalization and Robustness Implications in Object-Centric Learning , 2021, ICML.

[44]  S. Bagon,et al.  Deep ViT Features as Dense Visual Descriptors , 2021, ArXiv.

[45]  Iro Laina,et al.  ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation , 2021, NeurIPS Datasets and Benchmarks.

[46]  Julius von Kügelgen,et al.  Unsupervised Object Learning via Common Fate , 2021, CLeaR.

[47]  Jean Ponce,et al.  Localizing Objects with Self-Supervised Transformers and no Labels , 2021, BMVC.

[48]  Jean Ponce,et al.  Localizing Objects with Self-Supervised Transformers and no Labels , 2021, BMVC.

[49]  Chen Change Loy,et al.  Unsupervised Object-Level Representation Learning from Scene Images , 2021, NeurIPS.

[50]  Cordelia Schmid,et al.  Large-Scale Unsupervised Object Discovery , 2021, NeurIPS.

[51]  Cordelia Schmid,et al.  Large-Scale Unsupervised Object Discovery , 2021, NeurIPS.

[52]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Ingmar Posner,et al.  GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement , 2021, NeurIPS.

[55]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Kavita Bala,et al.  PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Kavita Bala,et al.  PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yoshua Bengio,et al.  Towards Causal Representation Learning , 2021, ArXiv.

[59]  Yoshua Bengio,et al.  Towards Causal Representation Learning , 2021, ArXiv.

[60]  Wouter Van Gansbeke,et al.  Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Wouter Van Gansbeke,et al.  Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Georg Martius,et al.  Self-supervised Visual Reinforcement Learning with Object-centric Representations , 2020, ICLR.

[63]  Georg Martius,et al.  Self-supervised Visual Reinforcement Learning with Object-centric Representations , 2020, ICLR.

[64]  Andreas Geiger,et al.  GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[66]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[67]  Alexander S. Ecker,et al.  Benchmarking Unsupervised Object Representations for Video Sequences , 2020, J. Mach. Learn. Res..

[68]  Alexander S. Ecker,et al.  Benchmarking Unsupervised Object Representations for Video Sequences , 2020, J. Mach. Learn. Res..

[69]  Sergey Levine,et al.  Recurrent Independent Mechanisms , 2019, ICLR.

[70]  Chuang Gan,et al.  Object-Centric Diagnosis of Visual Reasoning , 2020, ArXiv.

[71]  Chuang Gan,et al.  Object-Centric Diagnosis of Visual Reasoning , 2020, ArXiv.

[72]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[73]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[74]  Myriam Tami,et al.  Autoregressive Unsupervised Image Segmentation , 2020, ECCV.

[75]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[76]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[77]  Jean Ponce,et al.  Toward unsupervised, multi-object discovery in large-scale image collections , 2020, ECCV.

[78]  Jean Ponce,et al.  Toward unsupervised, multi-object discovery in large-scale image collections , 2020, ECCV.

[79]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[80]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[81]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[82]  Peter V. Gehler,et al.  Towards causal generative scene models via competition of experts , 2020, ArXiv.

[83]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[84]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[85]  Tie-Yan Liu,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[86]  Tie-Yan Liu,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[87]  Sungjin Ahn,et al.  SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition , 2020, ICLR.

[88]  Sungjin Ahn,et al.  SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition , 2020, ICLR.

[89]  Joelle Pineau,et al.  Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking , 2019, AAAI.

[90]  Gerard de Melo,et al.  SCALOR: Generative World Models with Scalable Object Representations , 2019, ICLR.

[91]  Gerard de Melo,et al.  SCALOR: Generative World Models with Scalable Object Representations , 2019, ICLR.

[92]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[93]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[94]  Yee Whye Teh,et al.  Probabilistic symmetry and invariant neural networks , 2019, J. Mach. Learn. Res..

[95]  John D. Co-Reyes,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[96]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[97]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[98]  Alexander Lerchner,et al.  Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs , 2019, ArXiv.

[99]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[100]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[101]  Tecture For SPATIAL BROADCAST DECODER: A SIMPLE ARCHI- , 2019 .

[102]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[103]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[104]  Siva Karthik Mustikovela,et al.  Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes , 2017, International Journal of Computer Vision.

[105]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[106]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[107]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[108]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[109]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[110]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[111]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[113]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[114]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[115]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[116]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[117]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[118]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[119]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[120]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[121]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[122]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[123]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[124]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[125]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[126]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[127]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[128]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[129]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[130]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[131]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[132]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[133]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[134]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[135]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[136]  D. Kahneman,et al.  The reviewing of object files: Object-specific integration of information , 1992, Cognitive Psychology.

[137]  D. Kahneman,et al.  The reviewing of object files: Object-specific integration of information , 1992, Cognitive Psychology.