American sign language recognition: reducing the complexity of the task with phoneme-based modeling and parallel hidden markov models

In this thesis I present a framework for recognizing American Sign Language (ASL) from 3D data. The goal is to develop approaches that will scale well with increasing vocabulary sizes. Scalability is a major concern, because the computational treatment of ASL is a very complex undertaking. Two points particularly stand out: First, ASL is a highly inflected language, resulting in too many appearances of inflectional variants to model them all separately. Second, in ASL events occur both sequentially and simultaneously. Unlike speech recognition, ASL recognition cannot consider all possible combinations of simultaneous events explicitly, because of their sheer number. As a result, the computational treatment of ASL is much more complex than the computational treatment of spoken languages. Reducing the complexity of the task requires a two-pronged approach, which encompasses work on both the modeling and the computational sides. On the modeling side, I tackle the many appearances by breaking the signs down into their constituent phonemes, which are limited in number. I use the Movement-Hold phonological model for ASL as a guideline, and extend the parts of it that are not directly applicable to recognition systems. In addition, I recast it to describe simultaneous events in independent channels, so that it is no longer necessary to consider all their possible combinations. The result is a significant reduction of the modeling complexity. On the recognition side, I pose parallel hidden Markov models (PaHMMs) as an extension to conventional hidden Markov models. I develop a PaHMM recognition algorithm specifically geared toward the properties of sign languages. PaHMMs are the computational counterpart to modeling simultaneous events in independent channels, and allow putting them together on the fly at recognition time, instead of having to consider them a-priori. I validate the modeling approach and the PaHMM recognition algorithm in a pilot study with experiments on 53-sign and 22-sign data sets. In the PaHMM experiments, the independent channels consist of the hand movements of both hands, and the handshape of the strong hand. The results demonstrate the viability of both the phoneme modeling and the description of simultaneous events in independent channels.

[1]  Penny Kaye Boyes-Braem Features of the handshape in American sign language , 1981 .

[2]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[3]  Robert C. Bolles,et al.  The Representation Space Paradigm of Concurrent Evolving Object Descriptions , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Siome Goldenstein,et al.  Directed acyclic graph representation of deformable models , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[5]  Dimitris N. Metaxas,et al.  Optical Flow Constraints on Deformable Models with Applications to Face Tracking , 2000, International Journal of Computer Vision.

[6]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[8]  K. Emmorey,et al.  The Syntax of American Sign Language: Functional Categories and Hierarchical Structure by Carol Neidle et al. , 2000, Trends in Cognitive Sciences.

[9]  Yangsheng Xu,et al.  Online, interactive learning of gestures for human/robot interfaces , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[10]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[11]  M. B. Waldron,et al.  Isolated ASL sign recognition system for deaf persons , 1995 .

[12]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Aaron F. Bobick,et al.  Nonlinear PHMMs for the interpretation of parameterized gesture , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Pavel Laskov,et al.  A MULTI-STAGE APPROACH TO FINGERSPELLING AND GESTURE RECOGNITION , 1996 .

[16]  Ming Ouhyoung,et al.  A real-time continuous gesture recognition system for sign language , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[17]  Dimitris N. Metaxas,et al.  A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[18]  Misha Pavel,et al.  Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  John Goldsmith,et al.  SECONDARY LICENSING AND THE NONDOMINANT HAND IN ASL PHONOLOGY , 1993 .

[20]  Alex Pentland,et al.  Invariant features for 3-D gesture recognition , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[21]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Ceil Lucas,et al.  Sign Language Research: Theoretical Issues , 1990 .

[23]  E. Klima The signs of language , 1979 .

[24]  Mohammed Waleed Kadous,et al.  Temporal classification: extending the classification paradigm to multivariate time series , 2002 .

[25]  Ceil Lucas,et al.  Linguistics of American Sign Language: An Introduction , 1995 .

[26]  W. Sandler Phonological Representation of the Sign: Linearity and Nonlinearity in American Sign Language , 1989 .

[27]  Ioannis A. Kakadiaris,et al.  Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  W. Stokoe,et al.  Sign language structure: an outline of the visual communication systems of the American deaf. 1960. , 1961, Journal of deaf studies and deaf education.

[29]  U. Bellugi,et al.  Perception of American sign language in dynamic point-light displays. , 1981, Journal of experimental psychology. Human perception and performance.

[30]  Matthew Stone,et al.  An anthropometric face model using variational techniques , 1998, SIGGRAPH.

[31]  Mohammed Waleed Kadous,et al.  Machine Recognition of Auslan Signs Using PowerGloves: Towards Large-Lexicon Recognition of Sign Lan , 1996 .

[32]  Jeffrey Mark Siskind,et al.  A Maximum-Likelihood Approach to Visual Event Classification , 1996, ECCV.

[33]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[34]  Norman I. Badler,et al.  A machine translation system from English to American Sign Language , 2000, AMTA.

[35]  David M. Perlmutter SONORITY AND SYLLABLE STRUCTURE IN AMERICAN SIGN LANGUAGE , 1993 .

[36]  Ioannis A. Kakadiaris,et al.  3D human body model acquisition from multiple views , 1995, Proceedings of IEEE International Conference on Computer Vision.

[37]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[38]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[39]  Karl-Friedrich Kraiss,et al.  Towards an Automatic Sign Language Recognition System Using Subunits , 2001, Gesture Workshop.

[40]  Ipke Wachsmuth,et al.  Interpretation of Shape-Related Iconic Gestures in Virtual Environments , 2001, Gesture Workshop.

[41]  Dimitris N. Metaxas Physics-Based Deformable Models: Applications to Computer Vision, Graphics, and Medical Imaging , 1996 .

[42]  Sylvie Gibet,et al.  Corpus 3D Natural Movements and Sign Language Primitives of Movement , 1997, Gesture Workshop.

[43]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[44]  Wen Gao,et al.  Signer-Independent Continuous Sign Language Recognition Based on SRN/HMM , 2001, Gesture Workshop.

[45]  Siome Goldenstein,et al.  Affine arithmetic based estimation of cue distributions in deformable model tracking , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[46]  Aaron F. Bobick,et al.  Recognition of human body motion using phase space constraints , 1995, Proceedings of IEEE International Conference on Computer Vision.

[47]  Karl-Friedrich Kraiss,et al.  Video-based sign recognition using self-organizing subunits , 2002, Object recognition supported by user interaction for service robots.

[48]  Diane Brentari,et al.  A Prosodic Model of Sign Language Phonology , 1999 .

[49]  Dimitris N. Metaxas,et al.  ASL recognition based on a coupling between HMMs and 3D motion analysis , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[50]  Ioannis A. Kakadiaris,et al.  Active part-decomposition, shape and motion estimation of articulated objects: a physics-based approach , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[51]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  James W. Cooley How the FFT gained acceptance , 1992 .

[53]  Annelies Braffort ARGo: An Architecture for Sign Language Recognition and Interpretation , 1996, Gesture Workshop.

[54]  Wen Gao,et al.  A Real-Time Large Vocabulary Recognition System for Chinese Sign Language , 2001, Gesture Workshop.

[55]  Li Deng,et al.  Large vocabulary word recognition using context-dependent allophonic hidden Markov models☆ , 1990 .

[56]  Dimitris N. Metaxas,et al.  Parallel hidden Markov models for American sign language recognition , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[57]  Dimitris N. Metaxas,et al.  Toward Scalability in ASL Recognition: Breaking Down Signs into Phonemes , 1999, Gesture Workshop.

[58]  Yuntao Cui,et al.  Learning-based hand sign recognition using SHOSLIF-M , 1995, Proceedings of IEEE International Conference on Computer Vision.

[59]  Brigitte Dorner,et al.  CHASING THE COLOUR GLOVE: VISUAL HAND TRACKING , 1994 .

[60]  Wendy Sandler,et al.  LINEARIZATION OF PHONOLOGICAL TIERS IN ASL , 1993 .

[61]  KwangYun Wohn,et al.  Recognition of space-time hand-gestures using hidden Markov model , 1996, VRST.

[62]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Hermann Hienz,et al.  HMM-Based Continuous Sign Language Recognition Using Stochastic Grammars , 1999, Gesture Workshop.

[64]  R. Battison,et al.  Lexical Borrowing in American Sign Language , 1978 .

[65]  Ying Wu,et al.  Vision-Based Gesture Recognition: A Review , 1999, Gesture Workshop.

[66]  Dimitris N. Metaxas,et al.  Adapting hidden Markov models for ASL recognition by using three-dimensional computer vision methods , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[67]  Scott K. Liddell,et al.  American Sign Language: The Phonological Base , 2013 .

[68]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[69]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[70]  Lawrence R. Rabiner,et al.  Subword-based large-vocabulary speech recognition , 1993, AT&T Technical Journal.