MobiPose: real-time multi-person pose estimation on mobile devices

Human pose estimation is a key technique for many vision-based mobile applications. Yet existing multi-person pose-estimation methods fail to achieve a satisfactory user experience on commodity mobile devices such as smartphones, due to their long model-inference latency. In this paper, we propose MobiPose, a system designed to enable real-time multi-person pose estimation on mobile devices through three novel techniques. First, MobiPose takes a motion-vector-based approach to fast locate the human proposals across consecutive frames by fine-grained tracking of joints of human body, rather than running the expensive human-detection model for every frame. Second, MobiPose designs a mobile-friendly model that uses lightweight multi-stage feature extractions to significantly reduce the latency of pose estimation without compromising the model accuracy. Third, MobiPose leverages the heterogeneous computing resources of both CPU and GPU to execute the pose estimation model for multiple persons in parallel, which further reduces the total latency. We have implemented the MobiPose system on off-the-shelf commercial smartphones and conducted comprehensive experiments to evaluate the effectiveness of the proposed techniques. Evaluation results show that MobiPose achieves over 20 frames per second pose estimation with 3 persons per frame, and significantly outperforms the state-of-the-art baseline, with a speedup of up to 4.5X and 2.8X in latency on CPU and GPU, respectively, and an improvement of 5.1% in pose-estimation model accuracy. Furthermore, MobiPose achieves up to 62.5% and 37.9% energy-per-frame saving on average in comparison with the baseline on mobile CPU and GPU, respectively.

[1]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[2]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[3]  Xuanzhe Liu,et al.  A First Look at Deep Learning Apps on Smartphones , 2018, WWW.

[4]  Éric Marchand,et al.  Pose Estimation for Augmented Reality: A Hands-On Survey , 2016, IEEE Transactions on Visualization and Computer Graphics.

[5]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[7]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[8]  Wang Caihong,et al.  Design and Implementation of a Real-Time Video Stream Analysis System Based on FFMPEG , 2013, 2013 Fourth World Congress on Software Engineering.

[9]  Didier Stricker,et al.  Fully Automatic Multi-person Human Motion Capture for VR Applications , 2018, EuroVR.

[10]  Chun Chen,et al.  A survey of human pose estimation: The body parts parsing based methods , 2015, J. Vis. Commun. Image Represent..

[11]  Daehyun Kim,et al.  μLayer: Low Latency On-Device Inference Using Cooperative Single-Layer Acceleration and Processor-Friendly Quantization , 2019, EuroSys.

[12]  Robert B. McGhee,et al.  Design, Implementation, and Experimental Results of a Quaternion-Based Kalman Filter for Human Body Motion Tracking , 2005, IEEE Transactions on Robotics.

[13]  Xuanzhe Liu,et al.  DeepCache: Principled Cache for Mobile Deep Vision , 2017, MobiCom.

[14]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Francieli Zanon Boito,et al.  Performance/energy trade-off in scientific computing: the case of ARM big.LITTLE and Intel Sandy Bridge , 2015, IET Comput. Digit. Tech..

[18]  Rajesh Krishna Balan,et al.  DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications , 2017, MobiSys.

[19]  Alexander J. Smola,et al.  Compressed Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[21]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[22]  Neil T. Dantam,et al.  Augmented, Mixed, and Virtual Reality Enabling of Robot Deixis , 2018, HCI.

[23]  Paramvir Bahl,et al.  Glimpse: Continuous, Real-Time Object Recognition on Mobile Devices , 2015, SenSys.

[24]  Guanghan Ning,et al.  A Top-Down Approach to Articulated Human Pose Estimation and Tracking , 2018, ECCV Workshops.

[25]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[26]  Yichen Wei,et al.  Simple Baselines for Human Pose Estimation and Tracking , 2018, ECCV.

[27]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[28]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[31]  Matthew Mattina,et al.  Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[32]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Takanori Yokoyama,et al.  Motion Vector Based Moving Object Detection and Tracking in the MPEG Compressed Domain , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[34]  Tobias Senst,et al.  Extending IOU Based Multi-Object Tracking by Visual Information , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[35]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Philip S. Yu,et al.  Deep Learning towards Mobile Applications , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[37]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Takashi Sato,et al.  Interpolation-Based Object Detection Using Motion Vectors for Embedded Real-time Tracking Systems , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Ke Wang,et al.  AI Benchmark: Running Deep Neural Networks on Android Smartphones , 2018, ECCV Workshops.

[40]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[41]  Nicholas D. Lane,et al.  MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors , 2019, MobiCom.

[42]  Zhi-Li Zhang,et al.  DeepCache: A Deep Learning Based Framework For Content Caching , 2018, NetAI@SIGCOMM.

[43]  Pangfeng Liu,et al.  A collaborative CPU‐GPU approach for deep learning on mobile devices , 2019, Concurr. Comput. Pract. Exp..

[44]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  ZhuJianke,et al.  A survey of human pose estimation , 2015 .

[48]  Pangfeng Liu,et al.  CPU/GPU Collaboration Techniques for Transfer Learning on Mobile Devices , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[49]  Marco Gruteser,et al.  Edge Assisted Real-time Object Detection for Mobile Augmented Reality , 2019, MobiCom.

[50]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.