Multimodal gesture recognition

Starting from the famous "Put That There!" demonstration prototype, developed by the Architecture Machine Group at MIT in the late 1970s, the growing potential of multimodal gesture interfaces in natural human-machine communication setups has stimulated people's imagination and motivated significant research efforts in the fields of computer vision, speech recognition, multimodal sensing, fusion, and human-computer interaction (HCI). In the words of Bolt [1980, p. 1]: "Because voice can be augmented with simultaneous pointing, the free usage of pronouns becomes possible, with a corresponding gain in naturalness and economy of expression. Conversely, gesture aided by voice gains precision in its power to reference." Multimodal gesture recognition lies at the heart of such interfaces. As also defined in the Glossary, the term refers to the complex computational task comprising three main modules: (a) tracking of human movements, primarily of the hands and arms, and recognition of characteristic such motion patterns; (b) detection of accompanying speech activity and recognition of what is spoken; and (c) combination of the available audio-visual information streams to identify the multimodally communicated message. To successfully perform such tasks, the original "Put That There!" system of Bolt [1980] imposed certain limitations on the interaction. Specifically, it required that the user be tethered by wearing a position sensing device on the wrist to capture gesturing and a headset microphone to record speech, and it allowed multimodal manipulation via speech and gestures of a small only set of shapes on a rather large screen (see also Figure 11.1). Since then, however, research efforts in the field of multimodal gesture recognition have moved beyond such limited scenarios, capturing and processing the multimodal data streams by employing distant audio and visual sensors that are unobtrusive to humans. In particular, in recent years, the introduction of affordable and compact multimodal sensors like the Microsoft Kinect has enabled robust capturing of human activity. This is due to the wealth of raw and metadata streams provided by the device, in addition to the traditional planar RGB video, such as depth scene information, multiple audio channels, and human skeleton and facial tracking, among others [Kinect 2016]. Such advancements have led to intensified efforts to integrate multimodal gesture interfaces in real-life applications. Indeed, the field of multimodal gesture recognition has been attracting increasing interest, being driven by novel HCI paradigms on a continuously expanding range of devices equipped with multimodal sensors and ever-increasing computational power, for example smartphones and smart television sets. Nevertheless, the capabilities of modern multimodal gesture systems remain limited. In particular, the set of gestures accounted for in typical setups is mostly constrained to pointing gestures, a number of emblematic ones like an open palm, and gestures corresponding to some sort of interaction with a physical object, e.g., pinching for zooming. At the same time, fusion with speech remains in most cases just an experimental feature. When compared to the abundance and variety of gestures and their interaction with speech in natural human communication, it clearly seems that there is still a long way to go for the corresponding HCI research and development [Kopp 2013]. Multimodal gesture recognition constitutes a wide multi-disciplinary field. This chapter makes an effort to provide a comprehensive overview of it, both in theoretical and application terms. More specifically, basic concepts related to gesturing, the multifaceted interplay of gestures and speech, and the importance of gestures in HCI are discussed in Section 11.2. An overview of the current trends in the field of multimodal gesture recognition is provided in Section 11.3, separately focusing on gestures, speech, and multimodal fusion. Furthermore, a state-of-the-art recognition setup developed by the authors is described in detail in Section 11.4, in order to facilitate a better understanding of all practical considerations involved in such a system. In closing, the future of multimodal gesture recognition and related challenges are discussed in Section 11.5. Finally, a set of Focus Questions to aid comprehension of the material is also provided.

[1]  Sharon L. Oviatt,et al.  The Paradigm Shift to Multimodality in Contemporary Computer Interfaces , 2015, Synthesis Lectures on Human-Centered Informatics.

[2]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Jonathan Foote,et al.  An overview of audio information retrieval , 1999, Multimedia Systems.

[4]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[5]  Matthew Turk,et al.  Multimodal interaction: A review , 2014, Pattern Recognit. Lett..

[6]  Guest Editorial Gesture and speech in interaction : An overview , 2013 .

[7]  Petros Maragos,et al.  Multimodal gesture recognition via multiple hypotheses rescoring , 2015, J. Mach. Learn. Res..

[8]  Benjamin Schrauwen,et al.  Sign Language Recognition Using Convolutional Neural Networks , 2014, ECCV Workshops.

[9]  Sergio Escalera,et al.  ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary , 2013, ICMI '13.

[10]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[11]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[12]  Giulio Paci,et al.  A Multi-scale Approach to Gesture Detection and Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[13]  K. Pine,et al.  The effects of prohibiting gestures on children's lexical retrieval ability. , 2007, Developmental science.

[14]  Ankit Chaudhary,et al.  Intelligent Approaches to interact with Machines using Hand Gesture Recognition in Natural way: A Survey , 2011, ArXiv.

[15]  Jian Cheng,et al.  Bayesian Co-Boosting for Multi-modal Gesture Recognition , 2014, Gesture Recognition.

[16]  Dimitris N. Metaxas,et al.  A Framework for Recognizing the Simultaneous Aspects of American Sign Language , 2001, Comput. Vis. Image Underst..

[17]  Jean-Marc Colletta,et al.  Age-related changes in co-speech gesture and narrative: Evidence from French children and adults , 2010, Speech Commun..

[18]  Robert M. Krauss,et al.  Gesture and Speech in Spontaneous and Rehearsed Narratives , 1994 .

[19]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Chung-Lin Huang,et al.  Hand gesture recognition using a real-time tracking method and hidden Markov models , 2003, Image Vis. Comput..

[21]  Petros Maragos,et al.  On Shape Recognition and Language , 2016, Perspectives in Shape Analysis.

[22]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[23]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[24]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[25]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[26]  Sharon L. Oviatt,et al.  Multimodal Integration - A Statistical View , 1999, IEEE Trans. Multim..

[27]  Susan Duncan,et al.  Growth points from the very beginning , 2008 .

[28]  Limin Wang,et al.  Action and Gesture Temporal Spotting with Super Vector Representation , 2014, ECCV Workshops.

[29]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Philippe A. Palanque,et al.  Fusion engines for multimodal input: a survey , 2009, ICMI-MLMI '09.

[31]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[32]  B. Rimé The elimination of visible behaviour from social interactions: Effects on verbal, nonverbal and interpersonal variables , 1982 .

[33]  Michael Johnston,et al.  Finite-state Multimodal Parsing and Understanding , 2000, COLING.

[34]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Sergio Escalera,et al.  Challenges in multimodal gesture recognition , 2016, J. Mach. Learn. Res..

[36]  Hanqing Lu,et al.  Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[37]  Martin Saerbeck,et al.  Recent methods and databases in vision-based hand gesture recognition: A review , 2015, Comput. Vis. Image Underst..

[38]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[39]  Mohammed Yeasin,et al.  A real-time framework for natural multimodal interaction with large screen displays , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[40]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[41]  Manolya Kavakli,et al.  A survey of speech-hand gesture recognition for the development of multimodal interfaces in computer games , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[42]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  D. McNeill Gesture and Thought , 2005 .

[44]  Susan M. Wagner,et al.  Explaining Math: Gesturing Lightens the Load , 2001, Psychological science.

[45]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[46]  Yihsiu Chen,et al.  Language and Gesture: Lexical gestures and lexical access: a process model , 2000 .

[47]  Susan Goldin-Meadow,et al.  Children Learn When Their Teacher's Gestures and Speech Differ , 2005, Psychological science.

[48]  Petros Maragos,et al.  Multimodal human action recognition in assistive human-robot interaction , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Steve Young,et al.  The HTK book , 1995 .

[50]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[51]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[52]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[53]  G. Schafer,et al.  Maternal label and gesture use affects acquisition of specific object names. , 2011, Journal of child language.

[54]  Vaibhava Goel,et al.  Audio and visual modality combination in speech processing applications , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[55]  Anupam Agrawal,et al.  Vision based hand gesture recognition for human computer interaction: a survey , 2012, Artificial Intelligence Review.

[56]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[57]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[58]  M. Alibali,et al.  Gesture's role in speaking, learning, and creating language. , 2013, Annual review of psychology.

[59]  Sharon L. Oviatt,et al.  Combining User Modeling and Machine Learning to Predict Users' Multimodal Integration Patterns , 2006, MLMI.

[60]  Autumn B. Hostetter,et al.  When do gestures communicate? A meta-analysis. , 2011, Psychological bulletin.

[61]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[62]  J. P. Foley,et al.  Gesture and Environment , 1942 .

[63]  L A Thompson,et al.  Evaluation and integration of speech and pointing gestures during referential understanding. , 1986, Journal of experimental child psychology.

[64]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[65]  Richard Rose,et al.  Discriminant wordspotting techniques for rejecting non-vocabulary utterances in unconstrained speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Stefanos Zafeiriou,et al.  Deep learning for multisensorial and multimodal interaction , 2018, The Handbook of Multimodal-Multisensor Interfaces, Volume 2.

[67]  Stefan Kopp,et al.  Using cognitive models to understand multimodal processes: the case for speech and gesture production , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[68]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[69]  Ezequiel Morsella,et al.  The role of gestures in spatial working memory and speech. , 2004, The American journal of psychology.

[70]  R. Krauss,et al.  Word Familiarity Predicts Temporal Asynchrony of Hand Gestures and Speech , 2010 .

[71]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[72]  Sergio Escalera,et al.  ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[73]  Joëlle Coutaz,et al.  A design space for multimodal systems: concurrent processing and data fusion , 1993, INTERCHI.

[74]  U. Hadar,et al.  Gesture and the Processing of Speech: Neuropsychological Evidence , 1998, Brain and Language.

[75]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[76]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[77]  D. McNeill So you think gestures are nonverbal , 1985 .

[78]  Michael Johnston,et al.  Unification-based Multimodal Parsing , 1998, ACL.

[79]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[80]  Petros Maragos,et al.  Kinect-based multimodal gesture recognition using a two-pass fusion scheme , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[81]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[82]  Kazuya Takeda,et al.  Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech , 2014, EURASIP Journal on Audio, Speech, and Music Processing.

[83]  Philip R. Cohen,et al.  Multimodal speech and pen interfaces , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[84]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  R. Krauss Why Do We Gesture When We Speak? , 1998 .

[86]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[87]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[88]  De Ruiter,et al.  Can gesticulation help aphasic people speak, or rather, communicate? , 2006 .

[89]  Patricia Zukow-Goldring,et al.  Sensitive Caregiving Fosters the Comprehension of Speech: When Gestures Speak Louder than Words , 1996 .

[90]  J. Cassell Computer Vision for Human–Machine Interaction: A Framework for Gesture Generation and Interpretation , 1998 .

[91]  Jakob Nielsen A Virtual Protocol Model for Computer-Human Interaction , 1984 .

[92]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[93]  S. Goldin-Meadow,et al.  Why people gesture when they speak , 1998, Nature.

[94]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[95]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[96]  Jana Bressem,et al.  Rethinking gesture phases: Articulatory features of gestural movement? , 2011 .

[97]  Wei-Yun Yau,et al.  A multi-modal gesture recognition system using audio, video, and skeletal joint data , 2013, ICMI '13.

[98]  Jacinta Douglas,et al.  The differential facilitatory effects of gesture and visualisation processes on object naming in aphasia , 2001 .

[99]  A. Burstein,et al.  Ideational gestures and speech in brain-damaged subjects , 1998 .

[100]  D. McNeill,et al.  Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information , 1998 .

[101]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[102]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[103]  David McNeill,et al.  Body – Language – Communication: An International Handbook on Multimodality in Human Interaction , 2013 .

[104]  S. Goldin-Meadow,et al.  Assessing Knowledge Through Gesture: Using Children's Hands to Read Their Minds , 1992 .

[105]  Stefan Kopp Giving interaction a hand: deep models of co-speech gesture in multimodal systems , 2013, ICMI '13.

[106]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[107]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  S. Nobe Language and Gesture: Where do most spontaneous representational gestures actually occur with respect to speech? , 2000 .

[109]  S. Mitra,et al.  Gesture Recognition: A Survey , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[110]  Petros Maragos,et al.  Cross-Modal Integration for Performance Improving in Multimedia: A Review , 2008, Multimodal Processing and Interaction.

[111]  Camille Monnier,et al.  A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition , 2014, ECCV Workshops.

[112]  R. Krauss,et al.  PSYCHOLOGICAL SCIENCE Research Article GESTURE, SPEECH, AND LEXICAL ACCESS: The Role of Lexical Movements in Speech Production , 2022 .

[113]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..