Unsupervised object learning from dense equivariant image labelling

One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations. Starting from the recent idea of viewpoint factorization, we propose a new approach that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame. This coordinate frame is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates. We demonstrate the applicability of this method to simple articulated objects and deformable objects such as human faces, learning embeddings from random synthetic transformations or optical flow correspondences, all without any manual supervision.

[1]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[7]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[10]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[12]  Feng Zhou,et al.  Deep Deformation Network for Object Landmark Localization , 2016, ECCV.

[13]  Erik G. Learned-Miller,et al.  Data driven image models through continuous joint alignment , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Hossein Mobahi,et al.  A Compositional Model for Low-Dimensional Image Set Representation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[17]  Ira Kemelmacher-Shlizerman,et al.  Collection flow , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yong Jae Lee,et al.  FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[21]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[22]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[23]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrea Vedaldi,et al.  Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Andrea Vedaldi,et al.  Fully-trainable deep matching , 2016, BMVC.

[27]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Alexei A. Efros,et al.  Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[30]  Xiaoou Tang,et al.  Learning Deep Representation for Face Alignment with Auxiliary Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[33]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).