American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence

In this thesis, we study the problem of recognizing video sequences of fingerspelled letters in American Sign Language (ASL). Fingerspelling comprises a significant but relatively understudied part of ASL, and recognizing it is challenging for a number of reasons: It involves quick, small motions that are often highly coarticulated; it exhibits significant variation between signers; and there has been a dearth of continuous fingerspelling data collected. In this work, we propose several types of recognition approaches, and explore the signer variation problem. Our best-performing models are segmental (semi-Markov) conditional random fields using deep neural network-based features. In the signer-dependent setting, our recognizers achieve up to about 8% letter error rates. The signer-independent setting is much more challenging, but with neural network adaptation we achieve up to 17% letter error rates.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Gregory Shakhnarovich,et al.  American sign language fingerspelling recognition with phonological feature-based tandem models , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[3]  Thomas Hain,et al.  Speaker dependent bottleneck layer training for speaker adaptation in automatic speech recognition , 2014, INTERSPEECH.

[4]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[5]  Trevor Darrell,et al.  Semi-supervised Domain Adaptation with Instance Constraints , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Matt Huenerfauth,et al.  Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research , 2010, SLPAT@NAACL.

[7]  David Windridge,et al.  A Linguistic Feature Vector for the Visual Interpretation of Sign Language , 2004, ECCV.

[8]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[9]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[10]  E. Fosler-Lussier,et al.  Efficient Segmental Conditional Random Fields for Phone Recognition , 2012 .

[11]  Kevin Gimpel,et al.  A comparison of training approaches for discriminative segmental models , 2014, INTERSPEECH.

[12]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Dimitris N. Metaxas,et al.  Handshapes and movements: Multiple-channel ASL recognition , 2004 .

[14]  Geoffrey Zweig,et al.  Classification and recognition with direct segment models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[16]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[17]  Svetha Venkatesh,et al.  Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Carlo Tomasi,et al.  Fingerspelling Recognition through Classification of Letter-to-Letter Transitions , 2009, ACCV.

[19]  Carol Padden,et al.  Learning American Sign Language: Levels I & II--Beginning & Intermediate, with DVD (Text & DVD Package) (2nd Edition) , 2003 .

[20]  Ali Farhadi,et al.  Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[22]  Eun-Jung Holden,et al.  Dynamic Fingerspelling Recognition using Geometric and Motion Features , 2006, 2006 International Conference on Image Processing.

[23]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[24]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Carol Padden,et al.  How the Alphabet Came to Be Used in a Sign Language , 2003 .

[26]  Petros Maragos,et al.  Affine-invariant modeling of shape-appearance images applied on sign language handshape classification , 2010, 2010 IEEE International Conference on Image Processing.

[27]  Hermann Ney,et al.  Improving Continuous Sign Language Recognition: Speech Recognition Techniques and System Design , 2013, SLPAT.

[28]  Hermann Ney,et al.  Speech recognition techniques for a sign language recognition system , 2007, INTERSPEECH.

[29]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[30]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[31]  Dimitris N. Metaxas,et al.  Parallel hidden Markov models for American sign language recognition , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[32]  Dimitris N. Metaxas,et al.  Toward Scalability in ASL Recognition: Breaking Down Signs into Phonemes , 1999, Gesture Workshop.

[33]  Ramesh Raskar,et al.  Exploiting Depth Discontinuities for Vision-Based Fingerspelling Recognition , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[36]  Geoffrey Zweig,et al.  Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Diane Brentari,et al.  A Prosodic Model of Sign Language Phonology , 1999 .

[39]  Sudeep Sarkar,et al.  Automated extraction of signs from continuous sign language sentences using Iterated Conditional Modes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Ming C. Leu,et al.  Recognition of Finger Spelling of American Sign Language with Artificial Neural Network Using Position/Orientation Sensors and Data Glove , 2005, ISNN.

[41]  Andrew Zisserman,et al.  Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[42]  Bin Yu,et al.  Feature learning based on SAE-PCA network for human gesture recognition in RGBD images , 2015, Neurocomputing.

[43]  Frank Wannemaker Foreign Vocabulary In Sign Languages A Cross Linguistic Investigation Of Word Formation , 2016 .

[44]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[47]  Martha E. Tyrone,et al.  Interarticulator co-ordination in Deaf signers with Parkinson’s disease , 1999, Neuropsychologia.

[48]  Rama Chellappa,et al.  Domain adaptation for object recognition: An unsupervised approach , 2011, 2011 International Conference on Computer Vision.

[49]  Seong-Whan Lee,et al.  Sign language spotting based on semi-Markov Conditional Random Field , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[50]  Kirsti Grobel,et al.  Video-based Recognition of Fingerspelling in Real-Time , 1996, Bildverarbeitung für die Medizin.

[51]  Stan Sclaroff,et al.  Exploiting phonological constraints for handshape inference in ASL video , 2011, CVPR 2011.

[52]  Gregory Shakhnarovich,et al.  Fingerspelling Recognition with Semi-Markov Conditional Random Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[55]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[56]  George Kollios,et al.  BoostMap: A method for efficient approximate similarity rankings , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[57]  Petros Maragos,et al.  Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition , 2011, CVPR 2011 WORKSHOPS.

[58]  Karen Livescu,et al.  Signer-independent fingerspelling recognition with deep neural network adaptation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[60]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[61]  Jovan Popovic,et al.  Real-time hand-tracking with a color glove , 2009, SIGGRAPH '09.

[62]  R. Mitchell,et al.  How Many People Use ASL in the United States? Why Estimates Need Updating , 2006 .

[63]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[64]  Nicolas Pugeault,et al.  Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[65]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[66]  Hermann Ney,et al.  Geometric Features for Improving Continuous Appearance-based Sign Language Recognition , 2006, BMVC.

[67]  Simon King,et al.  An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[68]  Yingli Tian,et al.  Histogram of 3D Facets: A depth descriptor for human action and hand gesture recognition , 2015, Comput. Vis. Image Underst..

[69]  Sherman Wilcox,et al.  The phonetics of fingerspelling , 1992 .

[70]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[71]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[72]  Ruiduo Yang,et al.  Detecting Coarticulation in Sign Language using Conditional Random Fields , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[73]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[74]  Kevin Gimpel,et al.  Discriminative segmental cascades for feature-rich phone recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[75]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[76]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[77]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[78]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[79]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[80]  Petros Maragos,et al.  Model-level data-driven sub-units for signs in videos of continuous Sign Language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81]  Jonathan Keane,et al.  Towards an articulatory model of handshape:What fingerspelling tells us about the phonetics and phonology of handshape in American Sign Language , 2014 .

[82]  Andrew Zisserman,et al.  Automatic and Efficient Long Term Arm and Hand Tracking for Continuous Sign Language TV Broadcasts , 2012, BMVC.

[83]  Stephan Liwicki,et al.  Automatic recognition of fingerspelled words in British Sign Language , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[84]  Stan Sclaroff,et al.  Large Lexicon Project : American Sign Language Video Corpus and Sign Language Indexing / Retrieval Algorithms , 2010 .

[85]  Tanja Schultz,et al.  Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[86]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[87]  Samir I. Shaheen,et al.  Sign language recognition using a combination of new vision based features , 2011, Pattern Recognit. Lett..

[88]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[89]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[90]  Liya Ding,et al.  Modelling and recognition of the linguistic components in American Sign Language , 2009, Image Vis. Comput..

[91]  Dimitris N. Metaxas,et al.  A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[92]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[93]  Hermann Ney,et al.  Signspeak--understanding, recognition, and translation of sign languages , 2010 .

[94]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.