论文信息 - American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence

American Sign Language fingerspelling recognition from video: Methods for unrestricted recognition and signer-independence

In this thesis, we study the problem of recognizing video sequences of fingerspelled letters in American Sign Language (ASL). Fingerspelling comprises a significant but relatively understudied part of ASL, and recognizing it is challenging for a number of reasons: It involves quick, small motions that are often highly coarticulated; it exhibits significant variation between signers; and there has been a dearth of continuous fingerspelling data collected. In this work, we propose several types of recognition approaches, and explore the signer variation problem. Our best-performing models are segmental (semi-Markov) conditional random fields using deep neural network-based features. In the signer-dependent setting, our recognizers achieve up to about 8% letter error rates. The signer-independent setting is much more challenging, but with neural network adaptation we achieve up to 17% letter error rates.

Taehwan Kim

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Gregory Shakhnarovich,et al. American sign language fingerspelling recognition with phonological feature-based tandem models , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[3] Thomas Hain,et al. Speaker dependent bottleneck layer training for speaker adaptation in automatic speech recognition , 2014, INTERSPEECH.

[4] Li Wang,et al. Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[5] Trevor Darrell,et al. Semi-supervised Domain Adaptation with Instance Constraints , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Matt Huenerfauth,et al. Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research , 2010, SLPAT@NAACL.

[7] David Windridge,et al. A Linguistic Feature Vector for the Visual Interpretation of Sign Language , 2004, ECCV.

[8] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[9] Trevor Darrell,et al. Adapting Visual Category Models to New Domains , 2010, ECCV.

[10] E. Fosler-Lussier,et al. Efficient Segmental Conditional Random Fields for Phone Recognition , 2012 .

[11] Kevin Gimpel,et al. A comparison of training approaches for discriminative segmental models , 2014, INTERSPEECH.

[12] Tae-Kyun Kim,et al. Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[13] Dimitris N. Metaxas,et al. Handshapes and movements: Multiple-channel ASL recognition , 2004 .

[14] Geoffrey Zweig,et al. Classification and recognition with direct segment models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Khe Chai Sim,et al. Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[16] William W. Cohen,et al. Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[17] Svetha Venkatesh,et al. Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18] Carlo Tomasi,et al. Fingerspelling Recognition through Classification of Letter-to-Letter Transitions , 2009, ACCV.

[19] Carol Padden,et al. Learning American Sign Language: Levels I & II--Beginning & Intermediate, with DVD (Text & DVD Package) (2nd Edition) , 2003 .

[20] Ali Farhadi,et al. Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[22] Eun-Jung Holden,et al. Dynamic Fingerspelling Recognition using Geometric and Motion Features , 2006, 2006 International Conference on Image Processing.

[23] Kirsti Grobel,et al. Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[24] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[25] Carol Padden,et al. How the Alphabet Came to Be Used in a Sign Language , 2003 .

[26] Petros Maragos,et al. Affine-invariant modeling of shape-appearance images applied on sign language handshape classification , 2010, 2010 IEEE International Conference on Image Processing.

[27] Hermann Ney,et al. Improving Continuous Sign Language Recognition: Speech Recognition Techniques and System Design , 2013, SLPAT.

[28] Hermann Ney,et al. Speech recognition techniques for a sign language recognition system , 2007, INTERSPEECH.

[29] ChengXiang Zhai,et al. Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[30] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[31] Dimitris N. Metaxas,et al. Parallel hidden Markov models for American sign language recognition , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[32] Dimitris N. Metaxas,et al. Toward Scalability in ASL Recognition: Breaking Down Signs into Phonemes , 1999, Gesture Workshop.

[33] Ramesh Raskar,et al. Exploiting Depth Discontinuities for Vision-Based Fingerspelling Recognition , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[34] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35] Kaisheng Yao,et al. Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[36] Geoffrey Zweig,et al. Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Geoffrey E. Hinton,et al. On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38] Diane Brentari,et al. A Prosodic Model of Sign Language Phonology , 1999 .

[39] Sudeep Sarkar,et al. Automated extraction of signs from continuous sign language sentences using Iterated Conditional Modes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Ming C. Leu,et al. Recognition of Finger Spelling of American Sign Language with Artificial Neural Network Using Position/Orientation Sensors and Data Glove , 2005, ISNN.

[41] Andrew Zisserman,et al. Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[42] Bin Yu,et al. Feature learning based on SAE-PCA network for human gesture recognition in RGBD images , 2015, Neurocomputing.

[43] Frank Wannemaker. Foreign Vocabulary In Sign Languages A Cross Linguistic Investigation Of Word Formation , 2016 .

[44] Alex Pentland,et al. Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[45] Hui Jiang,et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46] Li Deng,et al. HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[47] Martha E. Tyrone,et al. Interarticulator co-ordination in Deaf signers with Parkinson’s disease , 1999, Neuropsychologia.

[48] Rama Chellappa,et al. Domain adaptation for object recognition: An unsupervised approach , 2011, 2011 International Conference on Computer Vision.

[49] Seong-Whan Lee,et al. Sign language spotting based on semi-Markov Conditional Random Field , 2009, 2009 Workshop on Applications of Computer Vision (WACV).

[50] Kirsti Grobel,et al. Video-based Recognition of Fingerspelling in Real-Time , 1996, Bildverarbeitung für die Medizin.

[51] Stan Sclaroff,et al. Exploiting phonological constraints for handshape inference in ASL video , 2011, CVPR 2011.

[52] Gregory Shakhnarovich,et al. Fingerspelling Recognition with Semi-Markov Conditional Random Fields , 2013, 2013 IEEE International Conference on Computer Vision.

[53] Jean-Luc Gauvain,et al. Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[55] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[56] George Kollios,et al. BoostMap: A method for efficient approximate similarity rankings , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[57] Petros Maragos,et al. Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition , 2011, CVPR 2011 WORKSHOPS.

[58] Karen Livescu,et al. Signer-independent fingerspelling recognition with deep neural network adaptation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[60] Steve Renals,et al. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[61] Jovan Popovic,et al. Real-time hand-tracking with a color glove , 2009, SIGGRAPH '09.

[62] R. Mitchell,et al. How Many People Use ASL in the United States? Why Estimates Need Updating , 2006 .

[63] Daniel P. W. Ellis,et al. Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[64] Nicolas Pugeault,et al. Spelling it out: Real-time ASL fingerspelling recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[65] Lale Akarun,et al. Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[66] Hermann Ney,et al. Geometric Features for Improving Continuous Appearance-based Sign Language Recognition , 2006, BMVC.

[67] Simon King,et al. An Articulatory Feature-Based Tandem Approach and Factored Observation Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[68] Yingli Tian,et al. Histogram of 3D Facets: A depth descriptor for human action and hand gesture recognition , 2015, Comput. Vis. Image Underst..

[69] Sherman Wilcox,et al. The phonetics of fingerspelling , 1992 .

[70] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[71] Koby Crammer,et al. A theory of learning from different domains , 2010, Machine Learning.

[72] Ruiduo Yang,et al. Detecting Coarticulation in Sign Language using Conditional Random Fields , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[73] Antonis A. Argyros,et al. Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[74] Kevin Gimpel,et al. Discriminative segmental cascades for feature-rich phone recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[75] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[76] Ciro Martins,et al. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[77] Geoffrey Zweig,et al. A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[78] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[79] Simon King,et al. Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[80] Petros Maragos,et al. Model-level data-driven sub-units for signs in videos of continuous Sign Language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81] Jonathan Keane,et al. Towards an articulatory model of handshape:What fingerspelling tells us about the phonetics and phonology of handshape in American Sign Language , 2014 .

[82] Andrew Zisserman,et al. Automatic and Efficient Long Term Arm and Hand Tracking for Continuous Sign Language TV Broadcasts , 2012, BMVC.

[83] Stephan Liwicki,et al. Automatic recognition of fingerspelled words in British Sign Language , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[84] Stan Sclaroff,et al. Large Lexicon Project : American Sign Language Video Corpus and Sign Language Indexing / Retrieval Algorithms , 2010 .

[85] Tanja Schultz,et al. Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[86] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[87] Samir I. Shaheen,et al. Sign language recognition using a combination of new vision based features , 2011, Pattern Recognit. Lett..

[88] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[89] Jonathan G. Fiscus,et al. Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[90] Liya Ding,et al. Modelling and recognition of the linguistic components in American Sign Language , 2009, Image Vis. Comput..

[91] Dimitris N. Metaxas,et al. A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[92] Andreas Stolcke,et al. SRILM at Sixteen: Update and Outlook , 2011 .

[93] Hermann Ney,et al. Signspeak--understanding, recognition, and translation of sign languages , 2010 .

[94] Hank Liao,et al. Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.