Upper Body Pose Estimation with Temporal Sequential Forests

Our objective is to efficiently and accurately estimate human upper body pose in gesture videos. To this end, we build on the recent successful applications of random forests (RF) classifiers and regressors, and develop a pose estimation model with the following novelties: (i) the joints are estimated sequentially, taking account of the human kinematic chain. This means that we don't have to make the simplifying assumption of most previous RF methods - that the joints are estimated independently; (ii) by combining both classifiers (as a mixture of experts) and regressors, we show that the learning problem is tractable and that more context can be taken into account; and (iii) dense optical flow is used to align multiple expert joint position proposals from nearby frames, and thereby improve the robustness of the estimates. The resulting method is computationally efficient and can overcome a number of the errors (e.g. confusing left/right hands) made by RF pose estimators that infer their locations independently. We show that we improve over the state of the art on upper body pose estimation for two public datasets: the BBC TV Signing dataset and the ChaLearn Gesture Recognition dataset.

[1]  Andrew Zisserman,et al.  Automatic and Efficient Long Term Arm and Hand Tracking for Continuous Sign Language TV Broadcasts , 2012, BMVC.

[2]  Andrew Zisserman,et al.  Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[3]  Sebastian Nowozin,et al.  Decision tree fields , 2011, 2011 International Conference on Computer Vision.

[4]  Ioannis Patras,et al.  Sieving Regression Forest Votes for Facial Feature Detection in the Wild , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Andrew Zisserman,et al.  Domain Adaptation for Upper Body Pose Tracking in Signed TV Broadcasts , 2013, BMVC.

[6]  Paul A. Bromiley,et al.  Robust and Accurate Shape Model Matching Using Random Forest Regression-Voting , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sebastian Nowozin,et al.  Regression Tree Fields — An efficient, non-parametric approach to image labeling problems , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[9]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[10]  Andrew Zisserman,et al.  Automatic and Efficient Human Pose Estimation for Sign Language Videos , 2013, International Journal of Computer Vision.

[11]  Peter Kontschieder,et al.  GeoF: Geodesic Forests for Learning Coupled Predictors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[13]  Sinisa Todorovic,et al.  (RF)^2 - Random Forest Random Field , 2010, NIPS.

[14]  Dimitris N. Metaxas,et al.  Entangled Decision Forests and Their Application for Semantic Segmentation of CT Images , 2011, IPMI.

[15]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[16]  Zhuowen Tu,et al.  Auto-Context and Its Application to High-Level Vision Tasks and 3D Brain Image Segmentation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Luc Van Gool,et al.  Human Pose Estimation Using Body Parts Dependent Joint Regressors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  Estimating Human Pose with Flowing Puppets , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[20]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.