SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition

We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as “Sequence-to-sequence” learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end. The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources. In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem. The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition (outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[5]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[6]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[7]  Helen Cooper,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[8]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[9]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[12]  Xilin Chen,et al.  Isolated Sign Language Recognition with Grassmann Covariance Matrices , 2016, TACC.

[13]  Lale Akarun,et al.  BosphorusSign: A Turkish Sign Language Recognition Corpus in Health and Finance Domains , 2016, LREC.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[16]  Rachel McKee,et al.  The Online Dictionary of New Zealand Sign Language , 2017 .

[17]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[18]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Hermann Ney,et al.  Extensions of the Sign Language Recognition and Translation Corpus RWTH-PHOENIX-Weather , 2014, LREC.

[21]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[22]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[23]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[25]  Lale Akarun,et al.  Sign Language Recognition for Assisting the Deaf in Hospitals , 2016, HBU.

[26]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[29]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[33]  Richard Bowden,et al.  Sign Language Recognition , 2011, Visual Analysis of Humans.

[34]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[37]  Graham Neubig,et al.  Neural Machine Translation and Sequence-to-sequence Models: A Tutorial , 2017, ArXiv.

[38]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[40]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.