End-to-End Convolutional Sequence Learning for ASL Fingerspelling Recognition

Although fingerspelling is an often overlooked component of sign languages, it has great practical value in the communication of important context words that lack dedicated signs. In this paper we consider the problem of fingerspelling recognition in videos, introducing an end-to-end lexicon-free model that consists of a deep auto-encoder image feature learner followed by an attention-based encoder-decoder for prediction. The feature extractor is a vanilla auto-encoder variant, employing a quadratic activation function. The learned features are subsequently fed into the attention-based encoder-decoder. The latter deviates from traditional recurrent neural network architectures, being a fully convolutional attention-based encoder-decoder that is equipped with a multi-step attention mechanism relying on a quadratic alignment function and gated linear units over the convolution output. The introduced model is evaluated on the TTIC/UChicago fingerspelling video dataset, where it outperforms previous approaches in letter accuracy under all three, signer-dependent, -adapted, and -independent, experimental paradigms.

[1]  Dianna Radpour,et al.  Using Deep Convolutional Networks for Gesture Recognition in American Sign Language , 2017, ArXiv.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Muhammad Sarfraz,et al.  A system for sign language recognition using fuzzy object similarity tracking , 2005, Ninth International Conference on Information Visualisation (IV'05).

[4]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[5]  H. Ng,et al.  A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction , 2018, AAAI.

[6]  Carlo Tomasi,et al.  Fingerspelling Recognition through Classification of Letter-to-Letter Transitions , 2009, ACCV.

[7]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[10]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[11]  Gregory Shakhnarovich,et al.  Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation , 2017, Comput. Speech Lang..

[12]  Tae-Kyun Kim,et al.  Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Karen Livescu,et al.  Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Stephan Liwicki,et al.  Automatic recognition of fingerspelled words in British Sign Language , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Markus Freitag,et al.  Beam Search Strategies for Neural Machine Translation , 2017, NMT@ACL.

[18]  Hermann Ney,et al.  Speech recognition techniques for a sign language recognition system , 2007, INTERSPEECH.

[19]  P. V. V. Kishore,et al.  Deep convolutional neural networks for sign language recognition , 2018, 2018 Conference on Signal Processing And Communication Engineering Systems (SPACES).

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Gregory Shakhnarovich,et al.  American Sign Language Fingerspelling Recognition in the Wild , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Eun-Jung Holden,et al.  Dynamic Fingerspelling Recognition using Geometric and Motion Features , 2006, 2006 International Conference on Image Processing.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[27]  Nicolas Pugeault,et al.  Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[28]  Ying Gao,et al.  Real-time sign language recognition based on neural network architecture , 2011, 2011 IEEE 43rd Southeastern Symposium on System Theory.

[29]  Dimitris N. Metaxas,et al.  Parallel hidden Markov models for American sign language recognition , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[30]  Benjamin Schrauwen,et al.  Sign Language Recognition Using Convolutional Neural Networks , 2014, ECCV Workshops.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[34]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.