Integral Knowledge Distillation for Multi-Person Pose Estimation

Both accuracy and efficiency are of equal importance to the human pose estimation. Most of the existing methods simply pursue excellent performance, sacrificing massive computing resources and memory. Out of this consideration, we present a novel compact and lightweight framework to train more efficient estimators using knowledge distillation. Three distillation mechanisms are proposed in our method from different perspectives, including logit distillation, feature distillation and structure distillation. Concretely, the logit distillation regards the output of teacher model as soft target to stimulate the student model. The feature distillation distills the high-level features of the teacher model to assist the student. Unlike the above strategies, the structure distillation considers the problem in a global view, aiming at ensuring the student prediction contains quite abundant structure knowledge like the teacher. We empirically demonstrate the effectiveness and efficiency of our methods on two multi-person pose estimation datasets (COCO and MPII). Specifically, our model can achieve competitive performance with the most state-of-the-art methods and consume only $\text{35}\%$ model parameters and GFLOPs of our baseline (SimpleBaseline-ResNet-50) on the COCO dataset.

[1]  Liang Lin,et al.  Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[3]  Daijin Kim,et al.  Accurate Human Pose Estimation by Aggregating Multiple Pose Hypotheses Using Modified Kernel Density Approximation , 2015, IEEE Signal Processing Letters.

[4]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[8]  Jian Dong,et al.  Towards Unified Human Parsing and Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Hwann-Tzong Chen,et al.  Self Adversarial Training for Human Pose Estimation , 2017, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[10]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12]  David Picard,et al.  Human Pose Regression by Combining Indirect Part Detection and Contextual Information , 2017, Comput. Graph..

[13]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[14]  Shuicheng Yan,et al.  Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation , 2018, ECCV.

[15]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[16]  Jonathan Tompson,et al.  PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model , 2018, ECCV.

[17]  Honggang Qi,et al.  Multi-Scale Structure-Aware Network for Human Pose Estimation , 2018, ECCV.

[18]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Shohreh Kasaei,et al.  A Bayesian Framework for Sparse Representation-Based 3-D Human Pose Estimation , 2014, IEEE Signal Processing Letters.

[20]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Xiu-Shen Wei,et al.  Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  Yin Li,et al.  An Optimization Based Framework for Human Pose Estimation , 2010, IEEE Signal Processing Letters.

[25]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Dongdong Yu,et al.  Multi-Person Pose Estimation With Enhanced Channel-Wise and Spatial Information , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fernando De la Torre,et al.  Spatio-Temporal Matching for Human Pose Estimation in Video , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Mohammed Bennamoun,et al.  SkeletonNet: Mining Deep Part Features for 3-D Action Recognition , 2017, IEEE Signal Processing Letters.

[29]  Zhi Zhang,et al.  Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation , 2017, IEEE Transactions on Multimedia.

[30]  Peng Wang,et al.  Joint Multi-person Pose Estimation and Semantic Part Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dong Liu,et al.  Human Pose Estimation Using Global and Local Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Mao Ye,et al.  Fast Human Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.