论文信息 - Multimodal Human Computer Interaction: A Survey

Multimodal Human Computer Interaction: A Survey

In this paper we review the major approaches to multimodal human computer interaction from a computer vision perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition, and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for Multimodal Human Computer Interaction (MMHCI) research.

Nicu Sebe | Alejandro Jaimes | N. Sebe | A. Jaimes

[1] Thomas B. Moeslund,et al. A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[2] Nicu Sebe,et al. Affective Meeting Video Analysis , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[3] Azriel Rosenfeld,et al. Face recognition: A literature survey , 2003, CSUR.

[4] Jake K. Aggarwal,et al. Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[5] Michael J. Black,et al. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion , 1995, Proceedings of IEEE International Conference on Computer Vision.

[6] Jeffrey S. Shell,et al. Augmenting and sharing memory with eyeBlog , 2004, CARPE'04.

[7] L. Rothkrantz,et al. Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[8] Arun Ross,et al. Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[9] Douglas B. Moran,et al. The Open Agent Architecture: A Framework for Building Distributed Software Systems , 1999, Appl. Artif. Intell..

[10] Niels Ole Bernsen. Defining a taxonomy of output modalities from an HCI perspective , 1997, Comput. Stand. Interfaces.

[11] Michael J. Lyons,et al. Designing, Playing, and Performing with a Vision-based Mouth Interface , 2003, NIME.

[12] Jun Ohya,et al. Recognizing multiple persons' facial expressions using HMM based on automatic extraction of significant frames from image sequences , 1997, Proceedings of International Conference on Image Processing.

[13] Marvin Minsky,et al. A framework for representing knowledge , 1974 .

[14] Steven M. Seitz,et al. Techniques for interactive audience participation , 2002, SIGGRAPH '02.

[15] Matthew Turk,et al. Gesture Recognition in Handbook of Virtual Environment Technology , 2001 .

[16] Alex Pentland,et al. LAFTER: a real-time face and lips tracker with facial expression recognition , 2000, Pattern Recognit..

[17] N. Ambady,et al. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. , 1992 .

[18] Ivan Marsic,et al. A framework for rapid development of multimodal interfaces , 2003, ICMI '03.

[19] Rómer Rosales,et al. Learning Body Pose via Specialized Maps , 2001, NIPS.

[20] Nicu Sebe,et al. Affective multimodal human-computer interaction , 2005, ACM Multimedia.

[21] Gregory D. Abowd,et al. Perceptual user interfaces using vision-based eye tracking , 2003, ICMI '03.

[22] Mathias Kölsch,et al. Emerging Topics in Computer Vision , 2004 .

[23] Sameer Singh,et al. Video analysis of human dynamics - a survey , 2003, Real Time Imaging.

[24] Erik Hjelmås,et al. Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[25] D. McNeill. Hand and Mind: What Gestures Reveal about Thought , 1992 .

[26] Qiang Ji,et al. Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver Vigilance , 2002, Real Time Imaging.

[27] James W. Davis,et al. GESTURE RECOGNITION , 2023, International Research Journal of Modernization in Engineering Technology and Science.

[28] Rajeev Sharma,et al. Understanding Gestures in Multimodal Human Computer Interaction , 2000, Int. J. Artif. Intell. Tools.

[29] Datong Chen,et al. Multimodal detection of human interaction events in a nursing home environment , 2004, ICMI '04.

[30] Mary P. Harper,et al. VACE Multimodal Meeting Corpus , 2005, MLMI.

[31] Illah R. Nourbakhsh,et al. A survey of socially interactive robots , 2003, Robotics Auton. Syst..

[32] Hiroshi Ishii,et al. Bricks: laying the foundations for graspable user interfaces , 1995, CHI '95.

[33] Juergen Luettin,et al. Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[34] Richard A. Bolt,et al. “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[35] Alex Pentland,et al. Looking at People: Sensing for Ubiquitous and Wearable Computing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36] Maja Pantic,et al. Automatic Analysis of Facial Expressions: The State of the Art , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[37] Dae-Jong Lee,et al. Emotion recognition from the facial image and speech signal , 2003, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[38] M. Mehta,et al. MULTIMODAL INPUT FUSION IN HUMAN-COMPUTER INTERACTION On the Example of the NICE Project , 2003 .

[39] Kirk P. Arnett,et al. Productivity gains via an adaptive user interface: an empirical analysis , 1994, Int. J. Hum. Comput. Stud..

[40] Margrit Betke,et al. Communication via eye blinks and eyebrow raises: video-based human-computer interfaces , 2003, Universal Access in the Information Society.

[41] Emile H. L. Aarts,et al. Ambient intelligence: a multimedia perspective , 2004, IEEE MultiMedia.

[42] Jian-Gang Wang,et al. Eye gaze estimation from a single image of one eye , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[43] W. Buxton. Human-Computer Interaction , 1988, Springer Berlin Heidelberg.

[44] Ben Shneiderman,et al. Direct manipulation for comprehensible, predictable and controllable user interfaces , 1997, IUI '97.

[45] Joseph A. Paradiso,et al. Optical Tracking for Music and Dance Performance , 1997 .

[46] Mark T. Maybury,et al. Intelligent multimedia interfaces , 1994, CHI Conference Companion.

[47] Sébastien Marcel,et al. Gestures for Multi-Modal Interfaces: A Review , 2002 .

[48] Rosalind W. Picard. Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[49] Paul F. M. J. Verschure,et al. Live Soundscape Composition Based on Synthetic Emotions , 2003, IEEE Multim..

[50] Richard Simpson,et al. The smart wheelchair component system. , 2004, Journal of rehabilitation research and development.

[51] Nicu Sebe,et al. Facial expression recognition from video sequences: temporal and static modeling , 2003, Comput. Vis. Image Underst..

[52] Pierre-Yves Oudeyer,et al. The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[53] Niels-ole Bernsen,et al. A Reference Model for Output Information in Intelligent Multimedia Presentation Systems , 1996 .

[54] Ted Selker,et al. Visual Attentive Interfaces , 2004 .

[55] Samy Bengio,et al. Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56] Nicu Sebe,et al. Human-centered computing: a multimedia perspective , 2006, MM '06.

[57] Sharon L. Oviatt,et al. Perceptual user interfaces: multimodal interfaces that process what comes naturally , 2000, CACM.

[58] Philip R. Cohen,et al. The role of voice in human-machine communication , 1994 .

[59] Patrizia Paggio. Multimodal Communication in the Virtual Farm of the Staging Project , 2001 .

[60] Yasuyuki Kono,et al. Real World Objects as Media for Augmenting Human Memory , 2003 .

[61] Narendra Ahuja,et al. Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[62] Niels Ole Bernsen,et al. Foundations of Multimodal Representations: A Taxonomy of Representational Modalities , 1994, Interact. Comput..

[63] Shumin Zhai,et al. Conversing with the user based on eye-gaze patterns , 2005, CHI.

[64] Philip R. Cohen,et al. MULTIMODAL INTERFACES THAT PROCESS WHAT COMES NATURALLY , 2000 .

[65] Nicu Sebe,et al. Emotion Recognition Based on Joint Visual and Audio Cues , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[66] Andrew T Duchowski,et al. A breadth-first survey of eye-tracking applications , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[67] Philip R. Cohen,et al. Tangible multimodal interfaces for safety-critical applications , 2004, CACM.

[68] Tieniu Tan,et al. A survey on visual surveillance of object motion and behaviors , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[69] Vladimir Pavlovic,et al. Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.

[70] Larry S. Davis,et al. Human expression recognition from motion using a radial basis function network architecture , 1996, IEEE Trans. Neural Networks.

[71] Ephraim P. Glinert,et al. Multimodal Integration , 1996, IEEE Multim..

[72] Dariu Gavrila,et al. The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[73] Colin Potts,et al. Design of Everyday Things , 1988 .

[74] A. Martinez,et al. Face image retrieval using HMMs , 1999, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL'99).

[75] Gang Hua,et al. Tracking articulated body by dynamic Markov network , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[76] Oudeyer Pierre-Yves,et al. The production and recognition of emotions in speech: features and algorithms , 2003 .

[77] Zhihong Zeng,et al. Bimodal HCI-related affect recognition , 2004, ICMI '04.

[78] Hatice Gunes,et al. Face and Body Gesture Recognition for a Vision-Based Multimodal Analyzer , 2004, VIP.

[79] Jeffrey S. Shell,et al. Hands on cooking: towards an attentive kitchen , 2003, CHI Extended Abstracts.

[80] Jing Xiao,et al. Meticulously detailed eye region model and its application to analysis of facial images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81] Antti Oulasvirta,et al. A cognitive meta-analysis of design approaches to interruptions in intelligent environments , 2004, CHI EA '04.

[82] Alex Pentland,et al. Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[83] Tieniu Tan,et al. Recent developments in human motion analysis , 2003, Pattern Recognit..

[84] Allen Newell,et al. The psychology of human-computer interaction , 1983 .

[85] Vladimir Pavlovic,et al. Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[86] Thomas S. Huang,et al. Modeling video using input/output Markov models with application to multi-modal event detection , 2003 .

[87] Marvin Minsky,et al. A framework for representing knowledge" in the psychology of computer vision , 1975 .

[88] Stan Sclaroff,et al. Automatic 2D Hand Tracking in Video Sequences , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[89] Björn W. Schuller,et al. Multimodal emotion recognition in audiovisual communication , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[90] Harry Wechsler,et al. Using Eye Region Biometrics to Reveal Affective and Cognitive States , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[91] Dirk Heylen,et al. Multimodal Communication in Inhabited Virtual Environments , 2002, Int. J. Speech Technol..

[92] Rajeev Sharma,et al. Experimental evaluation of vision and speech based multimodal interfaces , 2001, PUI '01.

[93] Mubarak Shah,et al. Determining driver visual attention with one camera , 2003, IEEE Trans. Intell. Transp. Syst..

[94] Christine L. Lisetti,et al. Modeling Multimodal Expression of User’s Affective Subjective Experience , 2002, User Modeling and User-Adapted Interaction.

[95] Jianyi Liu,et al. Hotspot Components for Gesture-Based Interaction , 2005, INTERACT.

[96] Mark Weiser,et al. Some computer science issues in ubiquitous computing , 1993, CACM.

[97] Sharon L. Oviatt,et al. Ten myths of multimodal interaction , 1999, Commun. ACM.

[98] Trevor Darrell,et al. MULTIMODAL INTERFACES THAT Flex, Adapt, and Persist , 2004 .

[99] Pat Langley,et al. User modeling in adaptive interfaces , 1999 .

[100] Sharon L. Oviatt,et al. Individual differences in multimodal integration patterns: what are they and why do they exist? , 2005, CHI.

[101] Larry S. Davis,et al. Recognizing Human Facial Expressions From Long Image Sequences Using Optical Flow , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[102] Mubarak Shah,et al. Ontology and taxonomy collaborated framework for meeting classification , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[103] Kostas Karpouzis,et al. Emotion Analysis in Man-Machine Interaction Systems , 2004, MLMI.

[104] Michael Strube,et al. Architecture and implementation of multimodal plug and play , 2003, ICMI '03.

[105] Nicu Sebe,et al. Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[106] Matthew Turk,et al. Perceptual user interfaces (introduction) , 2000, CACM.

[107] Katie Salen,et al. Rules of play: game design fundamentals , 2003 .

[108] Kent Larson,et al. A living laboratory for the design and evaluation of ubiquitous computing technologies , 2005, CHI Extended Abstracts.

[109] Philip R. Cohen,et al. QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[110] Andry Rakotonirainy,et al. A Survey of Research on Context-Aware Homes , 2003, ACSW.

[111] J. Lien,et al. Automatic recognition of facial expressions using hidden markov models and estimation of expression intensity , 1998 .

[112] Alex Pentland,et al. Perceptual user interfaces: perceptual intelligence , 2000, CACM.

[113] Kosuke Sato,et al. Real-time gesture recognition by learning and selective control of visual interest points , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[114] Ben Shneiderman,et al. Leonardo's laptop: human needs and the new computing technologies , 2005, CIKM '05.

[115] L. Paletta,et al. Mobile Vision for Ambient Learning in Urban Environments , 2004 .

[116] Yoshiaki Shirai,et al. Look where you're going [robotic wheelchair] , 2003, IEEE Robotics Autom. Mag..

[117] Jacob Eisenstein,et al. Building the Design Studio of the Future , 2004, AAAI Technical Report.

[118] Ramesh C. Jain,et al. Folk computing , 2002, CACM.

[119] Jeff B. Pelz. Portable eyetracking in natural behavior , 2004 .

[120] Oliviero Stock,et al. Multimodal intelligent information presentation , 2005 .

[121] L. C. De Silva,et al. Bimodal emotion recognition , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[122] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[123] Mohammed Yeasin,et al. Speech-gesture driven multimodal interfaces for crisis management , 2003, Proc. IEEE.

[124] Margrit Betke,et al. Evaluation of tracking methods for human-computer interaction , 2002, Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings..

[125] Lawrence S. Chen,et al. Joint processing of audio-visual information for the recognition of emotional expressions in human-c , 2000 .

[126] Alejandro Jaimes. Human-centered multimedia: culture, deployment, and access , 2006, IEEE Multimedia.

[127] Michael Johnston,et al. Multimodal Applications from Mobile to Kiosk , 2004 .

[128] P. Ekman. Emotion in the human face , 1982 .

[129] Hidekazu Yoshikawa. Modeling humans in human-computer interaction , 2002 .

[130] Abderrahmane Kheddar,et al. Tactile interfaces: a state-of-the-art survey , 2004 .

[131] Iain R. Murray,et al. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[132] Marko Balabanovic,et al. Exploring Versus Exploiting when Learning User Models for Text Recommendation , 2004, User Modeling and User-Adapted Interaction.

[133] James A. Larson,et al. Guidelines for multimodal user interface design , 2004, CACM.

[134] Bob Carpenter,et al. The logic of typed feature structures , 1992 .

[135] James L. Flanagan,et al. Multimodal interaction on PDA's integrating speech and pen inputs , 2003, INTERSPEECH.

[136] P. Lang. The emotion probe. Studies of motivation and attention. , 1995, The American psychologist.

[137] Sharon L. Oviatt,et al. Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[138] Alisa Rudnitskaya,et al. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie , 2005 .

[139] James M. Rehg,et al. Vision for a smart kiosk , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[140] Martin Kay,et al. Functional Unification Grammar: A Formalism for Machine Translation , 1984, ACL.

[141] S. Demleitner. [Communication without words]. , 1997, Pflege aktuell.

[142] Timothy F. Cootes,et al. A unified approach to coding and interpreting face images , 1995, Proceedings of IEEE International Conference on Computer Vision.

[143] Richard A. Volz,et al. Evaluation of a Haptic Mixed Reality System for Interactions with a Virtual Control Panel , 2005, Presence: Teleoperators & Virtual Environments.

[144] Flavia Sparacino,et al. The Museum Wearable: real-time sensor-driven understanding of visitors' interests for personalized visually-augmented museum experiences , 2002 .

[145] Jennifer Healey,et al. Toward Machine Emotional Intelligence: Analysis of Affective Physiological State , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[146] Robert J. K. Jacob,et al. Evaluation of eye gaze interaction , 2000, CHI.

[147] Rosalind W. Picard. Affective Computing , 1997 .

[148] Adam Cheyer,et al. MVIEWS: multimodal tools for the video analyst , 1998, IUI '98.

[149] Douglas DeCarlo,et al. Robust clustering of eye movement recordings for quantification of visual interest , 2004, ETRA.

[150] Shyamsundar Rajaram,et al. Human Activity Recognition Using Multidimensional Indexing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[151] Dana H. Ballard,et al. A multimodal learning interface for grounding spoken language in sensory perceptions , 2004, ACM Trans. Appl. Percept..

[152] Ying Wu,et al. Hand modeling, analysis and recognition , 2001, IEEE Signal Process. Mag..

[153] Matthew Turk,et al. Computer vision in the interface , 2004, CACM.

[154] Yasunari Yoshitomi,et al. Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face , 2000, Proceedings 9th IEEE International Workshop on Robot and Human Interactive Communication. IEEE RO-MAN 2000 (Cat. No.00TH8499).

[155] Dariu Gavrila,et al. Looking at people , 2007, AVSS.

[156] James W. Davis,et al. The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[157] M. Turk,et al. Perceptual Interfaces , 2003 .

[158] Niels Ole Bernsen,et al. Multimodality in Language and Speech Systems — From Theory to Design Support Tool , 2002 .

[159] Jakob Nielsen,et al. Noncommand user interfaces , 1993, CACM.

[160] Mohan M. Trivedi,et al. Occupant posture analysis with stereo and thermal infrared video: algorithms and experimental evaluation , 2004, IEEE Transactions on Vehicular Technology.

[161] Philip R. Cohen,et al. Towards a fault-tolerant multi-agent system architecture , 2000, AGENTS '00.

[162] Christian D. Schunn,et al. Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction , 2002, Proc. IEEE.

[163] Marco Porta,et al. Vision-based user interfaces: methods and applications , 2002, Int. J. Hum. Comput. Stud..

[164] Sharon L. Oviatt,et al. Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions , 2000, Hum. Comput. Interact..

[165] Joëlle Coutaz,et al. A design space for multimodal systems: concurrent processing and data fusion , 1993, INTERCHI.

[166] Thomas S. Huang,et al. Exploiting the dependencies in information fusion , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[167] Beat Fasel,et al. Automatic facial expression analysis: a survey , 2003, Pattern Recognit..

[168] Stephen A. Brewster,et al. Multimodal 'eyes-free' interaction techniques for wearable devices , 2003, CHI '03.

[169] Paul P. Maglio,et al. A robust algorithm for reading detection , 2001, PUI '01.

[170] Susan R. Fussell,et al. Gestures Over Video Streams to Support Remote Collaboration on Physical Tasks , 2004, Hum. Comput. Interact..

[171] Kenji Mase,et al. Recognition of Facial Expression from Optical Flow , 1991 .

[172] Anthony Jameson,et al. Making systems sensitive to the user's time and working memory constraints , 1998, IUI '99.

[173] Matthew Turk,et al. Multimodal Human-Computer Interaction , 2005 .

[174] A. Adjoudani,et al. On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[175] Alan Hanjalic,et al. Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[176] Sharon L. Oviatt,et al. Unification-based Multimodal Integration , 1997, ACL.

[177] Peter Robinson,et al. Real-Time Inference of Complex Mental States from Facial Expressions and Head Gestures , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[178] Ying Wu,et al. Vision-Based Gesture Recognition: A Review , 1999, Gesture Workshop.

[179] Z. Obrenovic,et al. Modeling multimodal human-computer interaction , 2004, Computer.

[180] Thierry Pun,et al. Design and Evaluation of Multimodal System for the Non-visual Exploration of Digital Pictures , 2003, INTERACT.

[181] Monson H. Hayes,et al. Face Recognition Using An Embedded HMM , 1999 .

[182] Anees Shaikh,et al. An Architecture for Multimodal Information Fusion , 1997 .

[183] Nicu Sebe,et al. Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[184] Chun Chen,et al. Audio-visual based emotion recognition - a new approach , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[185] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[186] Alex Pentland,et al. Socially aware, computation and communication , 2005, Computer.