Speech-gesture driven multimodal interfaces for crisis management

Emergency response requires strategic assessment of risks, decisions, and communications that are time critical while requiring teams of individuals to have fast access to large volumes of complex information and technologies that enable tightly coordinated work. The access to this information by crisis management teams in emergency operations centers can be facilitated through various human-computer interfaces. Unfortunately, these interfaces are hard to use, require extensive training, and often impede rather than support teamwork. Dialogue-enabled devices, based on natural, multimodal interfaces, have the potential of making a variety of information technology tools accessible during crisis management. This paper establishes the importance of multimodal interfaces in various aspects of crisis management and explores many issues in realizing successful speech-gesture driven, dialogue-enabled interfaces for crisis management. This paper is organized in five parts. The first part discusses the needs of crisis management that can be potentially met by the development of appropriate interfaces. The second part discusses the issues related to the design and development of multimodal interfaces in the context of crisis management. The third part discusses the state of the art in both the theories and practices involving these human-computer interfaces. In particular, it describes the evolution and implementation details of two representative systems, Crisis Management (XISM) and Dialog Assisted Visual Environment for Geoinformation (DAVE/spl I.bar/G). The fourth part speculates on the short-term and long-term research directions that will help addressing the outstanding challenges in interfaces that support dialogue and collaboration. Finally, the fifth part concludes the paper.

[1]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[2]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[3]  Clay Spinuzzi,et al.  Context and consciousness: Activity theory and human-computer interaction , 1997 .

[4]  C. Neti,et al.  A vision-based microphone switch for speech intent detection , 2001, Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems.

[5]  David Medyckyj-Scott,et al.  GIS Users Observed , 1996, Int. J. Geogr. Inf. Sci..

[6]  James M. Rehg,et al.  Singularity analysis for articulated object tracking , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[7]  Gerhard Rigoll,et al.  High Performance Real-Time Gesture Recognition Using Hidden Markov Models , 1997, Gesture Workshop.

[8]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[9]  Rajeev Sharma,et al.  Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[10]  Stan Z. Li,et al.  Real-time multi-view face detection , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[11]  Monson H. Hayes,et al.  Statistical Digital Signal Processing and Modeling , 1996 .

[12]  J. Hudson A DIAMOND ANNIVERSARY , 1979 .

[13]  Jens Rasmussen,et al.  Cognitive Systems Engineering , 2022 .

[14]  Mohammed Yeasin,et al.  Automatic acquisition and initialization of articulated models , 2003, Machine Vision and Applications.

[15]  Isaac Brewer,et al.  Cognitive Systems Engineering and GIScience : Lessons learned from a work domain analysis for the design of a collaborative , multimodal emergency management GIS , 2002 .

[16]  James M. Rehg,et al.  Vision for a smart kiosk , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Kouichi Murakami,et al.  Gesture recognition using recurrent neural networks , 1991, CHI.

[18]  Lars Bo Larsen,et al.  Multi modal user interaction in an automatic pool trainer , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[19]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[20]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[21]  Collin Wang,et al.  A virtual end-effector pointing system in point-and-direct robotics for inspection of surface flaws using a neural network based skeleton transform , 1993, [1993] Proceedings IEEE International Conference on Robotics and Automation.

[22]  Alex Pentland,et al.  Auditory Context Awareness via Wearable Computing , 1998 .

[23]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[24]  J. O'Rourke,et al.  Model-based image analysis of human motion using constraint propagation , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Stanley T. Birchfield,et al.  Elliptical head tracking using intensity gradients and color histograms , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[26]  Françoise Decortis,et al.  Activity theory, cognitive ergonomics and distributed cognition: three views of a transport company , 2000, Int. J. Hum. Comput. Stud..

[27]  Michel Beaudouin-Lafon,et al.  Charade: remote control of objects using free-hand gestures , 1993, CACM.

[28]  Candace L. Sidner,et al.  Using plan recognition in human-computer collaboration , 1999 .

[29]  Ipke Wachsmuth,et al.  Coverbal iconic gestures for object descriptions in virtual environments , 1999 .

[30]  Sharon Oviatt,et al.  Multimodal interactive maps: designing for human performance , 1997 .

[31]  Rajeev Sharma,et al.  Multimodal human-computer interaction for crisis management systems , 2002, Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings..

[32]  Michael J. Black,et al.  Learning image statistics for Bayesian tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[33]  James L. Flanagan,et al.  The huge microphone array , 1998, IEEE Concurr..

[34]  Alan M. MacEachren,et al.  How Maps Work - Representation, Visualization, and Design , 1995 .

[35]  Mohammed Yeasin,et al.  Visual understanding of dynamic hand gestures , 2000, Pattern Recognit..

[36]  Linda J. Ferrier,et al.  Using the Baby-Babble-Blanket for infants with motor problems: an empirical study , 1994, Assets '94.

[37]  Bin Jiang,et al.  Cognitive and Usability Issues in Geovisualization , 2001 .

[38]  Gerhard Fischer,et al.  Articulating the Task at Hand and Making Information Relevant to It , 2001, Hum. Comput. Interact..

[39]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Barry Brumitt,et al.  EasyLiving: Technologies for Intelligent Environments , 2000, HUC.

[41]  K. J. Vicente,et al.  Cognitive Work Analysis: Toward Safe, Productive, and Healthy Computer-Based Work , 1999 .

[42]  Eric Wagner,et al.  Open computing , 1991 .

[43]  Helman Stern,et al.  Adaptive color space switching for face tracking in multi-colored lighting environments , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[44]  Kirsti Grobel,et al.  Video-Based Sign Language Recognition Using Hidden Markov Models , 1997, Gesture Workshop.

[45]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[46]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[47]  A. MacEachren Cartography and GIS: facilitating collaboration , 2000 .

[48]  James F. Allen,et al.  Toward Conversational Human-Computer Interaction , 2001, AI Mag..

[49]  Paul F Kirvan Conversing with computers , 1984 .

[50]  Thomas G. Holzman Computer-human interface solutions for emergency medical care , 1999, INTR.

[51]  Mark W. Salisbury,et al.  Talk and draw: bundling speech and graphics , 1990, Computer.

[52]  Piotr Jankowski,et al.  GIS-Supported Collaborative Decision Making: Results of an Experiment , 2001 .

[53]  Edwin Hutchins,et al.  How a Cockpit Remembers Its Speeds , 1995, Cogn. Sci..

[54]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[55]  J. McGrath Groups: Interaction and Performance , 1984 .

[56]  Alex Pentland,et al.  The ALIVE system: full-body interaction with autonomous agents , 1995, Proceedings Computer Animation'95.

[57]  Marc P. Armstrong Requirements for the Development of GIS-Based Group Decision-Support Systems , 1994, J. Am. Soc. Inf. Sci..

[58]  Gang Xu,et al.  Understanding human motion patterns , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[59]  Mohammed Yeasin,et al.  Tracking body parts of multiple people: a new approach , 2001, Proceedings 2001 IEEE Workshop on Multi-Object Tracking.

[60]  Salim Roukos,et al.  Feature-based language understanding , 1997, EUROSPEECH.

[61]  P. Fitts The information capacity of the human motor system in controlling the amplitude of movement. , 1954, Journal of experimental psychology.

[62]  Barbara P. Buttenfield,et al.  Usability Evaluation of Digital Libraries , 1999 .

[63]  Geoffrey E. Hinton,et al.  Glove-Talk: a neural network interface between a data-glove and a speech synthesizer , 1993, IEEE Trans. Neural Networks.

[64]  Jack Mostow,et al.  When Speech Input is Not an Afterthought: A Reading Tutor that Listens , 2002 .

[65]  Michael Gleicher,et al.  Projective registration with difference decomposition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  David Zeltzer,et al.  A survey of glove-based input , 1994, IEEE Computer Graphics and Applications.

[67]  Lambert Schomaker,et al.  Audio visual and Multimodal Speech Systems , 2003 .

[68]  Lisa J. Stifelman,et al.  Paper and Pen Interaction with Structured Speech , 2001 .

[69]  Mohammed Yeasin,et al.  A real-time framework for natural multimodal interaction with large screen displays , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[70]  Vladimir Pavlovic,et al.  A dynamic Bayesian network approach to figure tracking using learned dynamic models , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[71]  Karen E. Lochbaum,et al.  A Collaborative Planning Model of Intentional Structure , 1998, CL.

[72]  Markus F. Peschl,et al.  Representation still matters: Cognitive engineering and user interface design , 1998, Behav. Inf. Technol..

[73]  Mary Beth Rosson,et al.  Usability Engineering: Scenario-based Development of Human-Computer Interaction , 2001 .

[74]  Sharon L. Oviatt,et al.  Taming recognition errors with a multimodal interface , 2000, CACM.

[75]  Candace L. Sidner,et al.  COLLAGEN: A Collaboration Manager for Software Interface Agents , 1998, User Modeling and User-Adapted Interaction.

[76]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[77]  Gregory D. Hager,et al.  Real-time tracking of image regions with changes in geometry and illumination , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[78]  Patti Price Integration of Speech and Natural Language Understanding for Spoken Language Systems (SLS) , 1989, HLT.

[79]  D. McNeill Hand and Mind , 1995 .

[80]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[81]  Philip R. Cohen,et al.  Comparing paper and tangible, multimodal tools , 2002, CHI.

[82]  Thomas B. Moeslund,et al.  The IntelliMedia WorkBench - An Environment for Building Multimodal Systems , 1998, Cooperative Multimodal Communication.

[83]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[84]  I. Scott MacKenzie,et al.  A comparison of three selection techniques for touchpads , 1998, CHI.

[85]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[86]  Michael J. Black,et al.  EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation , 1996, International Journal of Computer Vision.

[87]  Stéphane Chatty,et al.  Pen computing for air traffic control , 1996, CHI.

[88]  Pietro Perona,et al.  Monocular tracking of the human arm in 3D , 1995, Proceedings of IEEE International Conference on Computer Vision.

[89]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[90]  Mark E. Lucente,et al.  Visualization Space: A Testbed for Deviceless Multimodal User Interface , 1998 .

[91]  Larry S. Davis,et al.  3-D model-based tracking of humans in action: a multi-view approach , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[92]  Juergen Luettin,et al.  Continuous Audio-Visual Speech Recognition , 1998, ECCV.

[93]  Mohammed Yeasin,et al.  Detecting and tracking human face and eye using an space-varying sensor and an active vision head , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[94]  Rama Chellappa,et al.  Automatic feature point extraction and tracking in image sequences for arbitrary camera motion , 1995, International Journal of Computer Vision.

[95]  Ioannis A. Kakadiaris,et al.  Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[96]  Vladimir Pavlovic,et al.  Speech/Gesture Interface to a Visual-Computing Environment , 2000, IEEE Computer Graphics and Applications.

[97]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[98]  Michael F. McTear,et al.  Book Review: Spoken Dialogue Technology: Toward the Conversational User Interface, by Michael F. McTear , 2002, CL.

[99]  William Buxton,et al.  Human-computer interaction: a multidisciplinary approach , 1987 .

[100]  André Meyer,et al.  Pen computing: a technology overview and a vision , 1995, SGCH.

[101]  Deborah Hix,et al.  User-Centered Design and Evaluation of Virtual Environments , 1999, IEEE Computer Graphics and Applications.

[102]  Suguru Ishizaki,et al.  GeoSpace: an interactive visualization system for exploring complex information spaces , 1995, CHI '95.

[103]  Sharon L. Oviatt,et al.  Error resolution during multimodal human-computer interaction , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[104]  B. Nardi Context and consciousness: activity theory and human-computer interaction , 1995 .

[105]  Sharon L. Oviatt,et al.  Multimodal interfaces for dynamic interactive maps , 1996, CHI.

[106]  K. Grant,et al.  Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. , 1998, The Journal of the Acoustical Society of America.

[107]  Ben Shneiderman,et al.  Designing the User Interface: Strategies for Effective Human-Computer Interaction , 1998 .

[108]  D. L. Quam,et al.  Gesture recognition with a DataGlove , 1990, IEEE Conference on Aerospace and Electronics.

[109]  Rajeev Sharma,et al.  Understanding Gestures in Multimodal Human Computer Interaction , 2000, Int. J. Artif. Intell. Tools.

[110]  Stefano Soatto,et al.  Real-Time Feature Tracking and Outlier Rejection with Changes in Illumination , 2001, ICCV.

[111]  Hironobu Fujiyoshi,et al.  Moving target classification and tracking from real-time video , 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201).

[112]  Rajeev Sharma,et al.  Toward Natural Gesture/Speech Control of a Large Display , 2001, EHCI.

[113]  A BoltRichard,et al.  Put-that-there , 1980 .

[114]  Michael Isard,et al.  Statistical models of visual shape and motion , 1998, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[115]  Alan M. MacEachren,et al.  Developing a conceptual framework for visually-enabled geocollaboration , 2004, Int. J. Geogr. Inf. Sci..

[116]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[117]  R C Moore,et al.  Integration of speech with natural language understanding. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[118]  Klaus Boehm,et al.  Dynamic gesture recognition using neural networks: a fundament for advanced interaction construction , 1994, Electronic Imaging.

[119]  Jakob Nielsen,et al.  Heuristic Evaluation of Prototypes (individual) , 2022 .

[120]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[121]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[122]  S. Kicha Ganapathy,et al.  A synthetic visual environment with hand gesturing and voice input , 1989, CHI '89.

[123]  Rajeev Sharma,et al.  Toward Natual Gesture/Speech HCI: A Case Study of Weather Narration , 1998 .

[124]  VNaoshi Matsuo Speaker Position Detection System Using Audio-visual Information , 1999 .

[125]  Antonella De Angeli,et al.  Visual display, pointing, and natural language: the power of multimodal interaction , 1998, AVI '98.

[126]  Ramesh C. Jain,et al.  Recursive identification of gesture inputs using hidden Markov models , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[127]  Rajeev Sharma,et al.  Designing a human-centered, multimodal GIS interface to support emergency management , 2002, GIS '02.

[128]  Mubarak Shah,et al.  A virtual 3D blackboard: 3D finger tracking using a single camera , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[129]  Scott Weinstein,et al.  Providing a Unified Account of Definite Noun Phrases in Discourse , 1983, ACL.

[130]  Alan M. MacEachren,et al.  Collaborative geographic visualization: enabling shared understanding of environmental processes , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[131]  Roberto Cipolla,et al.  Real-time tracking of highly articulated structures in the presence of noisy measurements , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[132]  A. Esposito,et al.  Speech driven facial animation , 2001, PUI '01.

[133]  David J. Israel,et al.  Plans and resource‐bounded practical reasoning , 1988, Comput. Intell..

[134]  Max J. Egenhofer,et al.  Query Processing in Spatial-Query-by-Sketch , 1997, J. Vis. Lang. Comput..

[135]  Chalapathy Neti,et al.  Audio-visual intent-to-speak detection for human-computer interaction , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[136]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[137]  Barry Arons,et al.  The audio notebook: paper and pen interaction with structured speech , 2001, CHI.

[138]  Sarit Kraus,et al.  Collaborative Plans for Complex Group Action , 1996, Artif. Intell..

[139]  Trevor Darrell,et al.  Audiovisual arrays for untethered spoken interfaces , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[140]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[141]  Michael Picheny,et al.  Large-Vocabulary Speech Recognition Algorithms , 2002, Computer.