Gesture in automatic discourse processing

Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning. My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract. These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features—extracted automatically from video—yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  James R. Glass,et al.  Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input , 2007, ACL.

[2]  Mary P. Harper,et al.  Structural event detection for rich transcription of speech , 2004 .

[3]  Francis K. H. Quek,et al.  Gestural trajectory symmetries and discourse segmentation , 2002, INTERSPEECH.

[4]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[5]  Dan Roth,et al.  Exploring evidence for shallow parsing , 2001, CoNLL.

[6]  A. Leroi‐Gourhan,et al.  Gesture and Speech , 1993 .

[7]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[8]  R. Krauss Why Do We Gesture When We Speak? , 1998 .

[9]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[10]  Philip R. Cohen,et al.  A map-based system using speech and 3D gestures for pervasive computing , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[11]  M. Swerts Prosodic features at discourse boundaries of different strength. , 1997, The Journal of the Acoustical Society of America.

[12]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[13]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[14]  A. Kendon Some Relationships Between Body Motion and Speech , 1972 .

[15]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[16]  S. Kelly,et al.  Neural correlates of bimodal speech and gesture comprehension , 2004, Brain and Language.

[17]  Julia Hirschberg,et al.  Acoustic indicators of topic segmentation , 1998, ICSLP.

[18]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[19]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[20]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[21]  Matthew Lease,et al.  Effective Use of Prosody in Parsing Conversational Speech , 2005, HLT.

[22]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[23]  Thomas S. Morton,et al.  Using Coreference for Question Answering , 1999, TREC.

[24]  J. Cassell Computer Vision for Human–Machine Interaction: A Framework for Gesture Generation and Interpretation , 1998 .

[25]  A. Kendon Gesticulation and Speech: Two Aspects of the Process of Utterance , 1981 .

[26]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[27]  Susan Goldin-Meadow,et al.  The Seeds of Spatial Grammar in the Manual Modality , 2005, Cogn. Sci..

[28]  Mitchell P. Marcus,et al.  Form: an experiment in the annotation of the kinetics of gesture , 2005 .

[29]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[30]  Michael Johnston,et al.  Balancing data-driven and rule-based approaches in the context of a multimodal conversational system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[31]  Diane J. Litman,et al.  Cue Phrase Classification Using Machine Learning , 1996, J. Artif. Intell. Res..

[32]  Candace L. Sidner,et al.  Towards a computational theory of definite anaphora comprehension in English discourse , 1979 .

[33]  T. Trabasso,et al.  Offering a Hand to Pragmatic Understanding: The Role of Speech and Gesture in Comprehension and Memory , 1999 .

[34]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[35]  A. Pentland,et al.  Computer Vision for Human–Machine Interaction: A Framework for Gesture Generation and Interpretation , 1998 .

[36]  Francis K. H. Quek,et al.  Hand gesture symmetric behavior detection and analysis in natural conversation , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[37]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[38]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[39]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[40]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[41]  Michael Halliday,et al.  Cohesion in English , 1976 .

[42]  Ephraim P. Glinert,et al.  Multimodal Integration , 1996, IEEE Multim..

[43]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[44]  Scott Johnson,et al.  Trading Spaces , 2008 .

[45]  Heather Shovelton,et al.  When size really matters: How a single semantic feature is represented in the speech and gesture modalities , 2006 .

[46]  Shingo Uchihashi,et al.  Video Manga: generating semantically meaningful video summaries , 1999, MULTIMEDIA '99.

[47]  Rajeev Sharma,et al.  Understanding Gestures in Multimodal Human Computer Interaction , 2000, Int. J. Artif. Intell. Tools.

[48]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[49]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[50]  Rajeev Sharma,et al.  Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[51]  Carl Pollard,et al.  A Centering Approach to Pronouns , 1987, ACL.

[52]  Simone Paolo Ponzetto,et al.  Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[53]  Sharon L. Oviatt,et al.  Unification-based Multimodal Integration , 1997, ACL.

[54]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[55]  Mary P. Harper,et al.  Gesture patterns during speech repairs , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[56]  Ellen Campana,et al.  Real-time Integration Of Gesture And Speech During Reference Resolution , 2005 .

[57]  James F. Allen,et al.  The TRAINS 93 Dialogues , 1995 .

[58]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[59]  Jianping Fan,et al.  Hierarchical video content description and summarization using unified semantic and visual similarity , 2003, Multimedia Systems.

[60]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[61]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[62]  D. Loehr Aspects of rhythm in gesture and speech , 2007 .

[63]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[64]  Rebecca J. Passonneau Applying Reliability Metrics to Co-Reference Annotation , 1997, ArXiv.

[65]  Rajeev Sharma,et al.  Toward Natual Gesture/Speech HCI: A Case Study of Weather Narration , 1998 .

[66]  Udo Hahn,et al.  Functional Centering - Grounding Referential Coherence in Information Structure , 1999, Comput. Linguistics.

[67]  Lynette Hirschman,et al.  Appendix F: MUC-7 Coreference Task Definition (version 3.0) , 1998, MUC.

[68]  Andrew Kehler,et al.  Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction , 2000, AAAI/IAAI.

[69]  M. Walker,et al.  Centering Theory in Discourse , 1998 .

[70]  Claire Cardie,et al.  Noun Phrase Coreference as Clustering , 1999, EMNLP.

[71]  Eugene Charniak,et al.  Sentence-Internal Prosody Does not Help Parsing the Way Punctuation Does , 2004, NAACL.

[72]  D. McNeill Hand and Mind , 1995 .

[73]  Yukiko I. Nakano,et al.  Non-Verbal Cues for Discourse Structure , 2022 .

[74]  A. Kendon Gestures as illocutionary and discourse structure markers in Southern Italian conversation , 1995 .

[75]  S. Goldin-Meadow,et al.  Hearing Gesture: How Our Hands Help Us Think , 2003 .

[76]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[77]  Regina Barzilay,et al.  Turning Lectures into Comic Books Using Linguistically Salient Gestures , 2007, AAAI.

[78]  R. Krauss,et al.  Do conversational hand gestures communicate? , 1991, Journal of personality and social psychology.

[79]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[80]  Mohammed Yeasin,et al.  Prosody based audiovisual coanalysis for coverbal gesture recognition , 2005, IEEE Transactions on Multimedia.

[81]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[82]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[83]  Jerry Nedelman,et al.  Book review: “Bayesian Data Analysis,” Second Edition by A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin Chapman & Hall/CRC, 2004 , 2005, Comput. Stat..

[84]  Philip R. Cohen,et al.  Intentions in Communication. , 1992 .

[85]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[86]  Megumi Kameyama,et al.  A Property-Sharing Constraint in Centering , 1986, ACL.

[87]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[88]  Jacob Eisenstein,et al.  Conditional Modality Fusion for Coreference Resolution , 2007, ACL.

[89]  Sotaro Kita,et al.  The content of the message influences the hand choice in co-speech gestures and in gesturing without speaking , 2003, Brain and Language.

[90]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[91]  Michelle X. Zhou,et al.  A probabilistic approach to reference resolution in multimodal user interfaces , 2004, IUI '04.

[92]  Christoph Müller Resolving It, This, and That in Unrestricted Multi-Party Dialog , 2007, ACL.

[93]  Mark Steedman Structure and Intonation in Spoken Language Undestanding , 1990, ACL.

[94]  Andrew Blake,et al.  Articulated body motion capture by annealed particle filtering , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[95]  Michael Strube,et al.  The Influence of Minimum Edit Distance on Reference Resolution , 2002, EMNLP.

[96]  Regina Barzilay,et al.  Gestural Cohesion for Topic Segmentation , 2008, ACL.

[97]  Mary P. Harper,et al.  Using maximum entropy (ME) model to incorporate gesture cues for SU detection , 2006, ICMI '06.

[98]  Daniel Marcu,et al.  A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model , 2005, HLT.

[99]  Ron Artstein,et al.  The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account , 2005, FCA@ACL.

[100]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[101]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[102]  Barbara Di Eugenio,et al.  Centering: A Parametric Theory and Its Instantiations , 2004, Computational Linguistics.

[103]  Regina Barzilay,et al.  Gesture Salience as a Hidden Variable for Coreference Resolution and Keyframe Extraction , 2008, J. Artif. Intell. Res..

[104]  Michael Strube,et al.  A Machine Learning Approach to Pronoun Resolution in Spoken Dialogue , 2003, ACL.

[105]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[106]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[107]  Walter L. Smith Probability and Statistics , 1959, Nature.

[108]  G. Beattie,et al.  An experimental investigation of the role of iconic gestures in lexical access using the tip-of-the-tongue phenomenon. , 1999, British journal of psychology.

[109]  Rebecca J. Passonneau Computing Reliability for Coreference Annotation , 2004, LREC.

[110]  Richard Power,et al.  Optimizing Referential Coherence in Text Generation , 2004, CL.

[111]  Jacob Eisenstein,et al.  Natural gesture in descriptive monologues , 2006, SIGGRAPH Courses.

[112]  Matthew Stone,et al.  Formal Semantics for Iconic Gesture , 2006 .

[113]  John R. Kender,et al.  Computational approaches to temporal sampling of video sequences , 2007, TOMCCAP.

[114]  Geoffrey E. Hinton,et al.  Learning Generative Texture Models with extended Fields-of-Experts , 2009, BMVC.

[115]  Susan M. Wagner,et al.  Explaining Math: Gesturing Lightens the Load , 2001, Psychological science.

[116]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[117]  Willem J. M. Levelt,et al.  Gesture and the communicative intention of the speaker , 2005 .

[118]  R. Krauss,et al.  PSYCHOLOGICAL SCIENCE Research Article GESTURE, SPEECH, AND LEXICAL ACCESS: The Role of Lexical Movements in Speech Production , 2022 .

[119]  Vincent Ng,et al.  Shallow Semantics for Coreference Resolution , 2007, IJCAI.

[120]  Francis K. H. Quek,et al.  Gesture, speech, and gaze cues for discourse segmentation , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[121]  W. R. Howard Conversational Informatics: An Engineering Approach , 2008 .

[122]  Mei-Yuh Hwang,et al.  An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[123]  Jacob Eisenstein,et al.  Building the Design Studio of the Future , 2004, AAAI Technical Report.

[124]  A. Kendon,et al.  Differential Perception and Attentional Frame in Face-to-Face Interaction: Two Problems for Investigation , 1978 .

[125]  Larry Gillick,et al.  A hidden Markov model approach to text segmentation and event tracking , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[126]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[127]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[128]  Paul N. Bennett,et al.  Combining Probability-Based Rankers for Action-Item Detection , 2007, NAACL.

[129]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[130]  Jacob Eisenstein,et al.  Visual and linguistic information in gesture classification , 2006 .

[131]  Francis K. H. Quek,et al.  Hand Motion Gesture Frequency Properties and Multimodal Discourse Analysis , 2006, International Journal of Computer Vision.

[132]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[133]  W. Chafe The Pear Stories: Cognitive, Cultural and Linguistic Aspects of Narrative Production , 1980 .

[134]  Paul Whitney The Psychology of Language , 1997 .

[135]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[136]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[137]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[138]  Shih-Fu Chang,et al.  Video Analysis and Summarization at Structural and Semantic Levels , 2003 .

[139]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[140]  Mark G. Core,et al.  Coding Dialogs with the DAMSL Annotation Scheme , 1997 .

[141]  The Rich Transcription Fall 2003 ( RT-03 F ) Evaluation Plan 1 , 2022 .

[142]  Gökhan Tür,et al.  Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[143]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[144]  Regina Barzilay,et al.  Discourse Topic and Gestural Form , 2008, AAAI.

[145]  Alex Pentland,et al.  Space-time gestures , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[146]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[147]  Christoph Muller Resolving It, This, and That in Unrestricted Multi-Party Dialog , 2007, ACL 2007.

[148]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[149]  D. McNeill Gesture and Thought , 2005 .

[150]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[151]  Yukiko I. Nakano,et al.  Towards a Model of Face-to-Face Grounding , 2003, ACL.

[152]  Sanda M. Harabagiu,et al.  RESOLUTION , 1977, Monatsschrift für Kriminologie und Strafrechtsreform.

[153]  Marilyn A. Walker,et al.  Centering, Anaphora Resolution, and Discourse Structure , 1997, ArXiv.

[154]  Anna Esposito,et al.  Automatic Hand Hold Detection in Natural Conversation , 2001 .

[155]  Julia Hirschberg,et al.  Some intonational characteristics of discourse structure , 1992, ICSLP.

[156]  Rashid Ansari,et al.  Multimodal human discourse: gesture and speech , 2002, TCHI.

[157]  Mary P. Harper,et al.  Multimodal model integration for sentence unit detection , 2004, ICMI '04.

[158]  Stefan Kopp,et al.  Trading Spaces: How Humans and Humanoids Use Speech and Gesture to Give Directions , 2007 .

[159]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[160]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[161]  Mary P. Harper,et al.  Gestural spatialization in natural discourse segmentation , 2002, INTERSPEECH.

[162]  Francis K. H. Quek The catchment feature model for multimodal language analysis , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[163]  Mari Ostendorf,et al.  Detecting Structural Metadata with Decision Trees and Transformation-Based Learning , 2004, HLT-NAACL.

[164]  Breck Baldwin,et al.  Dynamic Coreference-Based Summarization , 1998, EMNLP.

[165]  W. S. Condon,et al.  A segmentation of behavior , 1967 .

[166]  Judith Holler,et al.  Pragmatic aspects of representational gestures: Do speakers use them to clarify verbal ambiguity for the listener? , 2003 .

[167]  Anna Esposito,et al.  Disfluencies in gesture: Gestural correlates to filled and unfilled speech pauses , 2001 .

[168]  Shingo Uchihashi,et al.  An interactive comic book presentation for exploring video , 2000, CHI.