Deep representation learning for keypoint localization

Keypoint localization aims to locate points of interest from the input image. This technique has become an important tool for many computer vision tasks such as fine-grained visual categorization, object detection, and pose estimation. Tremendous effort, therefore, has been devoted to improving the performance of keypoint localization. However, most of the proposed methods supervise keypoint detectors using a confidence map generated from ground-truth keypoint locations. Furthermore, the maximum achievable localization accuracy differs from keypoint to keypoint, because it is determined by the underlying keypoint structures. Thus the keypoint detector often fails to detect ambiguous keypoints if trained with strict supervision, that is, permitting only a small localization error. Training with looser supervision could help detect the ambiguous keypoints, but this comes at a cost to localization accuracy for those keypoints with distinctive appearances. In this thesis, we propose hierarchically supervised nets (HSNs), a method that imposes hierarchical supervision within deep convolutional neural networks (CNNs) for keypoint localization. To achieve this, we firstly propose a fully convolutional Inception network with several branches of varying depths to obtain hierarchical feature representations. Then, we build a coarse part detector on top of each branch of features and a fine part detector which takes features from all the branches as the input. Collecting image data with keypoint annotations is harder than with image labels. One may collect images from Flickr or Google images by searching keywords and then perform refinement processes to build a classification dataset, while keypoint annotation requires human to click the rough location of the keypoint for each image. To address the problem of insufficient part annotations, we propose a part detection framework that combines deep representation learning and domain adaptation within the same training process. We adopt one of the coarse detector from HSNs as the baseline and perform a quantitative evaluation on CUB200-2011 and BirdSnap dataset. Interestingly, our method trained on only 10 species images achieves 61.4% PCK accuracy on the testing set of 190 unseen species. Finally, we explore the application of keypoint localization in the task of fine-grained visual categorization. We propose a new partbased model that consists of a localization module to detect object parts (where pathway) and a classification module to classify finegrained categories at the subordinate level (what pathway). Experimental results reveal that our method with keypoint localization achieves the state-of-the-art performance on Caltech-UCSD Birds200-2011 dataset.

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Joseph J. Lim,et al.  Recognition using regions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[6]  Dacheng Tao,et al.  Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Image Categorization , 2015, ArXiv.

[9]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Cristian Sminchisescu,et al.  Semi-supervised Hierarchical Models for 3D Human Pose Reconstruction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Subhransu Maji,et al.  Part and Attribute Discovery from Relative Annotations , 2014, International Journal of Computer Vision.

[12]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[13]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[14]  Peter N. Belhumeur,et al.  Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Andrea Vedaldi,et al.  I Have Seen Enough: Transferring Parts Across Categories , 2016, BMVC.

[20]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Zhiqiang Shen,et al.  Multiple Granularity Descriptors for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Trevor Darrell,et al.  Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[23]  Bernt Schiele,et al.  Automatic discovery of meaningful object parts with latent CRFs , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Shimon Ullman,et al.  Using Linking Features in Learning Non-parametric Part Models , 2012, ECCV.

[25]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Qi Tian,et al.  Fused one-vs-all mid-level features for fine-grained visual categorization , 2014, ACM Multimedia.

[27]  Ya Zhang,et al.  Part-Stacked CNN for Fine-Grained Visual Categorization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[30]  Pietro Perona,et al.  Entropy-based active learning for object recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Xiaogang Wang,et al.  Structured Feature Learning for Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Pietro Perona,et al.  Multiclass recognition and part localization with humans in the loop , 2011, 2011 International Conference on Computer Vision.

[35]  Ahmed M. Elgammal,et al.  SPDA-CNN: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew W. Fitzgibbon,et al.  The Joint Manifold Model for Semi-supervised Multi-valued Regression , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[37]  Saurabh Singh,et al.  Part Localization using Multi-Proposal Consensus for Fine-Grained Categorization , 2015, BMVC.

[38]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[39]  Shenghuo Zhu,et al.  Image segmentation for large-scale subcategory flower recognition , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[40]  Peter N. Belhumeur,et al.  Part-Pair Representation for Part Localization , 2014, ECCV.

[41]  Jun Zhu,et al.  DeePM: A Deep Part-Based Model for Object Detection and Semantic Part Localization , 2015, ArXiv.

[42]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[43]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Arnold W. M. Smeulders,et al.  Local Alignments for Fine-Grained Categorization , 2014, International Journal of Computer Vision.

[45]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[46]  Horst Bischof,et al.  Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[47]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[48]  Peter V. Gehler,et al.  Strong Appearance and Expressive Spatial Models for Human Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Subhransu Maji,et al.  Similarity Comparisons for Interactive Fine-Grained Categorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[51]  David W. Jacobs,et al.  Dog Breed Classification Using Part Localization , 2012, ECCV.

[52]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[53]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  David A. Forsyth,et al.  Learning a sequential search for landmarks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[56]  Yann LeCun,et al.  Multi-Digit Recognition Using a Space Displacement Neural Network , 1991, NIPS.

[57]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  W. John Kress,et al.  Leafsnap: A Computer Vision System for Automatic Plant Species Identification , 2012, ECCV.

[59]  Dieter Fox,et al.  Kernel Descriptors for Visual Recognition , 2010, NIPS.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Trevor Darrell,et al.  Sparse probabilistic regression for activity-independent human pose inference , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[63]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[64]  Cewu Lu,et al.  Deep LAC: Deep localization, alignment and classification for fine-grained recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Xiaogang Wang,et al.  End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[68]  Mengjie Zhang,et al.  Domain Generalization for Object Recognition with Multi-task Autoencoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Peter N. Belhumeur,et al.  How Do You Tell a Blackbird from a Crow? , 2013, 2013 IEEE International Conference on Computer Vision.

[70]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Trevor Darrell,et al.  Do Convnets Learn Correspondence? , 2014, NIPS.

[72]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[73]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[74]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[75]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[77]  Simon Lucey,et al.  Face alignment through subspace constrained mean-shifts , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[78]  Fred Nicolls,et al.  Locating Facial Features with an Extended Active Shape Model , 2008, ECCV.

[79]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[80]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[82]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[83]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[84]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[86]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  Pietro Perona,et al.  Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  David J. Fleet,et al.  Shared Kernel Information Embedding for discriminative inference , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[92]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[93]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[94]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[95]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Zeynep Akata,et al.  Fisher Vectors for Fine-Grained Visual Categorization , 2011, CVPR 2011.

[97]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[99]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[100]  Iasonas Kokkinos,et al.  Understanding Objects in Detail with Fine-Grained Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  Feng Zhou,et al.  Exemplar-Based Graph Matching for Robust Facial Landmark Localization , 2013, 2013 IEEE International Conference on Computer Vision.

[102]  Cristian Sminchisescu,et al.  Fast algorithms for large scale conditional 3D prediction , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[103]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Katja Markert,et al.  Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[105]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[106]  Qi Tian,et al.  Picking Deep Filter Responses for Fine-Grained Image Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[108]  Yuandong Tian,et al.  Exploring the Spatial Hierarchy of Mixture Models for Human Pose Estimation , 2012, ECCV.

[109]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[110]  Larry S. Davis,et al.  Multiple instance fFeature for robust part-based object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[111]  Pietro Perona,et al.  The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization , 2014, International Journal of Computer Vision.

[112]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[113]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[114]  Seung Woo Lee,et al.  Birdsnap: Large-Scale Fine-Grained Visual Categorization of Birds , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[115]  Varun Ramakrishna,et al.  Pose Machines: Articulated Pose Estimation via Inference Machines , 2014, ECCV.

[116]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[117]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[118]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[119]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[120]  Yang Gao,et al.  Fine-grained pose prediction, normalization, and recognition , 2015, ArXiv.

[121]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[122]  James J. Little,et al.  Fine-Grained Categorization for 3D Scene Understanding , 2012, BMVC.

[123]  Silvio Savarese,et al.  Articulated part-based model for joint object detection and pose estimation , 2011, 2011 International Conference on Computer Vision.

[124]  Yali Amit,et al.  POP: Patchwork of Parts Models for Object Recognition , 2007, International Journal of Computer Vision.

[125]  Tinne Tuytelaars,et al.  Unsupervised Visual Domain Adaptation Using Subspace Alignment , 2013, 2013 IEEE International Conference on Computer Vision.

[126]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[127]  Nikolaos Papanikolopoulos,et al.  Multi-class active learning for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[128]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[129]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[130]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[131]  Feng Zhou,et al.  Deep Deformation Network for Object Landmark Localization , 2016, ECCV.

[132]  Ya Zhang,et al.  Augmenting Strong Supervision Using Web Data for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).