Toward a 3D body part detection video dataset and hand tracking benchmark

The purpose of this paper is twofold. First, we introduce our Microsoft Kinect--based video dataset of American Sign Language (ASL) signs designed for body part detection and tracking research. This dataset allows researchers to experiment with using more than 2-dimensional (2D) color video information in gesture recognition projects, as it gives them access to scene depth information. Not only can this make it easier to locate body parts like hands, but without this additional information, two completely different gestures that share a similar 2D trajectory projection can be difficult to distinguish from one another. Second, as an accurate hand locator is a critical element in any automated gesture or sign language recognition tool, this paper assesses the efficacy of one popular open source user skeleton tracker by examining its performance on random signs from the above dataset. We compare the hand positions as determined by the skeleton tracker to ground truth positions, which come from manual hand annotations of each video frame. The purpose of this study is to establish a benchmark for the assessment of more advanced detection and tracking methods that utilize scene depth data. For illustrative purposes, we compare the results of one of the methods previously developed in our lab for detecting a single hand to this benchmark.

[1]  J. Schein At Home Among Strangers , 2009 .

[2]  Vassilis Athitsos,et al.  A System for Large Vocabulary Sign Search , 2010, ECCV Workshops.

[3]  Kikuo Fujimura,et al.  Visual Tracking Using Depth Data , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Vassilis Athitsos,et al.  Comparing gesture recognition accuracy using color and depth information , 2011, PETRA '11.

[5]  Clayton Valli,et al.  The Gallaudet Dictionary of American Sign Language , 2021 .

[6]  Luc Van Gool,et al.  Combining RGB and ToF cameras for real-time 3D hand gesture interaction , 2011, WACV.

[7]  L. Van Gool,et al.  Combining RGB and ToF cameras for real-time 3D hand gesture interaction , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[8]  Isabelle Guyon,et al.  ChaLearn gesture challenge: Design and first results , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[9]  G. Iddan,et al.  3D IMAGING IN THE STUDIO (AND ELSEWHERE...) , 2001 .

[10]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[11]  Stan Sclaroff,et al.  The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Vassilis Athitsos,et al.  Towards automated large vocabulary gesture search , 2009, PETRA '09.

[13]  H. Lane,et al.  A journey into the deaf-world , 1996 .