Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings

Learning automatically the structure of object categories remains an important open problem in computer vision. In this paper, we propose a novel unsupervised approach that can discover and learn landmarks in object categories, thus characterizing their structure. Our approach is based on factorizing image deformations, as induced by a viewpoint change or an object deformation, by learning a deep neural network that detects landmarks consistently with such visual effects. Furthermore, we show that the learned landmarks establish meaningful correspondences between different object instances in a category without having to impose this requirement explicitly. We assess the method qualitatively on a variety of object types, natural and man-made. We also show that our unsupervised landmarks are highly predictive of manually-annotated landmarks in face benchmark datasets, and can be used to regress these with a high degree of accuracy.

[1]  Pietro Perona,et al.  A Probabilistic Approach to Object Recognition Using Local Photometry and Global Geometry , 1998, ECCV.

[2]  Stefanos Zafeiriou,et al.  300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[3]  Andrea Vedaldi,et al.  AnchorNet: A Weakly Supervised Network to Learn Geometry-Sensitive Features for Semantic Matching , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yong Jae Lee,et al.  FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Donghoon Lee,et al.  Face alignment using cascade Gaussian process regression trees , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrea Vedaldi,et al.  Learning Covariant Feature Detectors , 2016, ECCV Workshops.

[9]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Luc Van Gool,et al.  Real-time facial feature detection using conditional regression forests , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Stefanos Zafeiriou,et al.  Robust Discriminative Response Map Fitting with Constrained Local Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Xiaoou Tang,et al.  Learning Deep Representation for Face Alignment with Auxiliary Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Hossein Mobahi,et al.  A Compositional Model for Low-Dimensional Image Set Representation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrea Vedaldi,et al.  Fully-trainable deep matching , 2016, BMVC.

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Alexei A. Efros,et al.  Learning Dense Correspondence via 3D-Guided Cycle Consistency , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[22]  Shiguang Shan,et al.  Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment , 2014, ECCV.

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Honglak Lee,et al.  Learning to Align from Scratch , 2012, NIPS.

[25]  Gang Hua,et al.  Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[27]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[30]  Michel F. Valstar,et al.  Guided Unsupervised Learning of Mode Specific Models for Facial Point Detection in the Wild , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[31]  Pietro Perona,et al.  Towards automatic discovery of object categories , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[32]  Jian Sun,et al.  Face Alignment at 3000 FPS via Regressing Local Binary Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[34]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Feng Zhou,et al.  Deep Deformation Network for Object Landmark Localization , 2016, ECCV.

[36]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[37]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[38]  Yang Yang,et al.  Stacked Deformable Part Model with Shape Regression for Object Part Localization , 2014, ECCV.

[39]  Ira Kemelmacher-Shlizerman,et al.  Collection flow , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[41]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  David J. Kriegman,et al.  Localizing Parts of Faces Using a Consensus of Exemplars , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[45]  Hanjiang Lai,et al.  Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks , 2016, ECCV.

[46]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[47]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[48]  Fred L. Bookstein,et al.  Principal Warps: Thin-Plate Splines and the Decomposition of Deformations , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Peter N. Belhumeur,et al.  Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Matti Pietikäinen,et al.  Unsupervised learning of overcomplete face descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[52]  David Cristinacce,et al.  Automatic feature localisation with constrained local models , 2008, Pattern Recognit..

[53]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Weiwei Zhang,et al.  Cat Head Detection - How to Effectively Exploit Shape and Texture Features , 2008, ECCV.

[55]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[56]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Luc Van Gool,et al.  Using a Deformation Field Model for Localizing Faces and Facial Points under Weak Supervision , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[59]  Saurabh Singh,et al.  Part Localization using Multi-Proposal Consensus for Fine-Grained Categorization , 2015, BMVC.

[60]  Yuning Jiang,et al.  Extensive Facial Landmark Localization with Coarse-to-Fine Convolutional Network Cascade , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[61]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62]  Pietro Perona,et al.  Cascaded pose regression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[63]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[64]  Gregory Shakhnarovich,et al.  Learning Representations for Automatic Colorization , 2016, ECCV.

[65]  Luc Van Gool,et al.  Unsupervised face alignment by robust nonrigid mapping , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[66]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[67]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Pietro Perona,et al.  Robust Face Landmark Estimation under Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[70]  M. K. Fleming,et al.  Categorization of faces using unsupervised feature extraction , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[71]  Cheng Li,et al.  Face alignment by coarse-to-fine shape searching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Kristen Grauman,et al.  Fine-Grained Visual Comparisons with Local Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[77]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[78]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[79]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[80]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[81]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[82]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Maja Pantic,et al.  Facial point detection using boosted regression and graph models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.