Domain Adaptation for Upper Body Pose Tracking in Signed TV Broadcasts

The objective of this work is to estimate upper body pose for signers in TV broadcasts. Given suitable training data, the pose is estimated using a random forest body joint detector. However, obtaining such training data can be costly. The novelty of this paper is a method of transfer learning which is able to harness existing training data and use it for new domains. Our contributions are: (i) a method for adapting existing training data to generate new training data by synthesis for signers with different appearances, and (ii) a method for personalising training data. As a case study we show how the appearance of the arms for different clothing, specifically short and long sleeved clothes, can be modelled to obtain person-specific trackers. We demonstrate that the transfer learning and person specific trackers significantly improve pose estimation performance.

[1]  Andrew Zisserman,et al.  Minimal Training, Large Lexicon, Unconstrained Sign Language Recognition , 2004, BMVC.

[2]  Mark Everingham,et al.  Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[3]  Min Sun,et al.  Conditional regression forests for human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Andrew Zisserman,et al.  Automatic and Efficient Long Term Arm and Hand Tracking for Continuous Sign Language TV Broadcasts , 2012, BMVC.

[6]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[7]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Andrew Zisserman,et al.  Identifying individuals in video by combining 'generative' and discriminative head models , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Richard Bowden,et al.  A boosted classifier tree for hand shape detection , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[10]  Andrew Zisserman,et al.  Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[11]  Ali Farhadi,et al.  Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Helen Cooper,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[13]  Richard Bowden,et al.  Large Lexicon Detection of Sign Language , 2007, ICCV-HCI.

[14]  Aphrodite Galata,et al.  Dynamical Pose Filtering for Mixtures of Gaussian Processes , 2012, BMVC.

[15]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[17]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[18]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.