Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled

This work presents a new approach to learning a framebased classifier on weakly labelled sequence data by embedding a CNN within an iterative EM algorithm. This allows the CNN to be trained on a vast number of example images when only loose sequence level information is available for the source videos. Although we demonstrate this in the context of hand shape recognition, the approach has wider application to any video recognition task where frame level labelling is not available. The iterative EM algorithm leverages the discriminative ability of the CNN to iteratively refine the frame level annotation and subsequent training of the CNN. By embedding the classifier within an EM framework the CNN can easily be trained on 1 million hand images. We demonstrate that the final classifier generalises over both individuals and data sets. The algorithm is evaluated on over 3000 manually labelled hand shape images of 60 different classes which will be released to the community. Furthermore, we demonstrate its use in continuous sign language recognition on two publicly available large sign language data sets, where it outperforms the current state-of-the-art by a large margin. To our knowledge no previous work has explored expectation maximization without Gaussian mixture models to exploit weak sequence labels for sign language recognition.

[1]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[2]  Mubarak Shah,et al.  Discovering Motion Primitives for Unsupervised Grouping and One-Shot Learning of Human Actions, Gestures, and Expressions , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Luc Van Gool,et al.  Efficient Mining of Frequent and Distinctive Feature Configurations , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[5]  Hermann Ney,et al.  RASR - The RWTH Aachen University Open Source Speech Recognition Toolkit , 2011 .

[6]  Nicolas Pugeault,et al.  Reading the signs: A video based sign dictionary , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Jochen Triesch,et al.  Classification of hand postures against complex backgrounds using elastic graph matching , 2002, Image Vis. Comput..

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  David Windridge,et al.  A Linguistic Feature Vector for the Visual Interpretation of Sign Language , 2004, ECCV.

[12]  Michal Kawulok,et al.  Fast propagation-based skin regions segmentation in color images , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[13]  Hermann Ney,et al.  Enhanced continuous sign language recognition using PCA and neural network features , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Rachel McKee,et al.  The Online Dictionary of New Zealand Sign Language , 2017 .

[15]  Karl-Friedrich Kraiss,et al.  Extraction of 3D hand shape and posture from image sequences for sign language recognition , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[16]  Moritz Knorr,et al.  The significance of facial features for automatic sign language recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[17]  Sudeep Sarkar,et al.  Automated extraction of signs from continuous sign language sentences using Iterated Conditional Modes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[19]  Ying Wu,et al.  Self-Supervised Learning for Object Recognition based on Kernel Discriminant-EM Algorithm , 2001, ICCV.

[20]  Georg Heigold,et al.  GMM-Free DNN Training , 2014 .

[21]  Andrew Gilbert,et al.  Combining discriminative and model based approaches for hand pose estimation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[22]  Hermann Ney,et al.  Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition , 2014, ECCV.

[23]  Richard Bowden,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Hermann Ney,et al.  Tracking using dynamic programming for appearance-based sign language recognition , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[25]  Hermann Ney,et al.  Modality Combination Techniques for Continuous Sign Language Recognition , 2013, IbPRIA.

[26]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[27]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Vassilis Athitsos,et al.  Nearest neighbor search methods for handshape recognition , 2008, PETRA '08.

[29]  Charles Markham,et al.  Weakly Supervised Training of a Sign Language Recognition System Using Multiple Instance Learning Density Matrices , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[30]  Stan Sclaroff,et al.  Estimating 3D hand pose from a cluttered image , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[31]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ron Kimmel,et al.  Rule of thumb: Deep derotation for improved fingertip detection , 2015, BMVC.

[33]  Petros Maragos,et al.  Dynamic affine-invariant shape-appearance handshape features and classification in sign language videos , 2013, J. Mach. Learn. Res..

[34]  Hermann Ney,et al.  May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[35]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Hermann Ney,et al.  Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather , 2014, LREC.

[37]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38]  Nicolas Pugeault,et al.  Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[39]  Takayoshi Yamashita,et al.  Hand posture recognition based on bottom-up structured deep convolutional neural network with curriculum learning , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[40]  Ali Farhadi,et al.  Aligning ASL for Statistical Translation Using a Discriminative Word Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).