Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

[1]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[2]  John Zimmerman,et al.  A probabilistic layered framework for integrating multimedia content and context information , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Chin-Hui Lee,et al.  A Multi-Modal Approach to Story Segmentation for News Video , 2003, World Wide Web.

[4]  Wojciech Pieczynski,et al.  Multisensor image segmentation using Dempster-Shafer fusion in Markov fields context , 2001, IEEE Trans. Geosci. Remote. Sens..

[5]  P Kuyper,et al.  The cocktail party effect. , 1972, Audiology : official organ of the International Society of Audiology.

[6]  Stéphane Lafortune,et al.  On an Optimization Problem in Sensor Selection* , 2002, Discret. Event Dyn. Syst..

[7]  Rong Yan,et al.  Probabilistic models for combining diverse knowledge sources in multimedia retrieval , 2006 .

[8]  Aggelos K. Katsaggelos,et al.  Audio-Visual Biometrics , 2006, Proceedings of the IEEE.

[9]  Richa Singh,et al.  DS theory based fingerprint classifier fusion with update rule to minimize training time , 2006, IEICE Electron. Express.

[10]  Jean-Luc Schwartz,et al.  Models for audiovisual fusion in a noisy-vowel recognition task , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[11]  Mark J. Buller,et al.  Confidence-based data management for personal area sensor networks , 2004, DMSN '04.

[12]  Huimin Chen,et al.  Tracking of multiple moving speakers with multiple microphone arrays , 2004, IEEE Transactions on Speech and Audio Processing.

[13]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[14]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[15]  Léon J. M. Rothkrantz,et al.  Facial Expression Recognition with Relevance Vector Machines , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[16]  G. Jaffré,et al.  Audio / Video Fusion : a Preprocessing Step for Multimodal Person Identification , 2022 .

[17]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Shih-Fu Chang,et al.  Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[19]  R. Schroeder LITERATURE SURVEY , 1981 .

[20]  Y. Oshman Optimal sensor selection strategy for discrete-time state estimators , 1994 .

[21]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[22]  Noboru Babaguchi,et al.  Personalized abstraction of broadcasted American football video by highlight selection , 2004, IEEE Transactions on Multimedia.

[23]  Ning Xiong,et al.  Multi-sensor management for information fusion: issues and approaches , 2002, Inf. Fusion.

[24]  Norbert Pfleger,et al.  FADE-An Integrated Approach to Multimodal Fusion and Discourse Processing , 2005 .

[25]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[26]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Arnaud Doucet,et al.  A survey of convergence results on particle filtering methods for practitioners , 2002, IEEE Trans. Signal Process..

[28]  Samy Bengio,et al.  How do correlation and variance of base-experts affect fusion in biometric authentication tasks? , 2005, IEEE Transactions on Signal Processing.

[29]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[30]  Mustapha Makkook,et al.  A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition , 2007 .

[31]  Jake K. Aggarwal,et al.  Object tracking in an outdoor environment using fusion of features and cameras , 2006, Image Vis. Comput..

[32]  Julian Fiérrez,et al.  A Comparative Evaluation of Fusion Strategies for Multimodal Biometric Verification , 2003, AVBPA.

[33]  S. Sridharan,et al.  Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier , 2001, Proceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing. ISIMP 2001 (IEEE Cat. No.01EX489).

[34]  Reynold Cheng,et al.  Sensor node selection for execution of continuous probabilistic queries in wireless sensor networks , 2004, VSSN '04.

[35]  Mohan S. Kankanhalli,et al.  Experience based sampling technique for multimedia analysis , 2003, MULTIMEDIA '03.

[36]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[37]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[38]  Mohan S. Kankanhalli,et al.  Confidence Building Among Correlated Streams in Multimedia Surveillance Systems , 2007, MMM.

[39]  Lizhong Xu,et al.  An image recognition method based on multiple BP neural networks fusion , 2004, International Conference on Information Acquisition, 2004. Proceedings..

[40]  Changsheng Xu,et al.  Using Webcast Text for Semantic Event Detection in Broadcast Sports Video , 2008, IEEE Transactions on Multimedia.

[41]  Bir Bhanu,et al.  Tracking Humans using Multi-modal Fusion , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[42]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[43]  Harriet J. Nock,et al.  Semantic annotation of multimedia using maximum entropy models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[44]  Gian Luca Foresti,et al.  A distributed sensor network for video surveillance of outdoor environments , 2002, Proceedings. International Conference on Image Processing.

[45]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[46]  Mohan S. Kankanhalli,et al.  Information assimilation framework for event detection in multimedia surveillance systems , 2006, Multimedia Systems.

[47]  Edward Y. Chang,et al.  Multimodal information fusion for video concept detection , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[48]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[49]  Shih-Fu Chang,et al.  News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003 , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Zhi-Hua Zhou Learning with unlabeled data and its application to image retrieval , 2006 .

[51]  Dingxing Wang,et al.  Boosting image classification with LDA-based feature combination for digital photograph management , 2005, Pattern Recognit..

[52]  Pradeep K. Atrey,et al.  Modeling and assessing quality of information in multisensor multimedia monitoring systems , 2011, TOMCCAP.

[53]  Esther de Ves,et al.  Applying logistic regression to relevance feedback in image retrieval systems , 2007, Pattern Recognit..

[54]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[55]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[56]  M. Mehta,et al.  MULTIMODAL INPUT FUSION IN HUMAN-COMPUTER INTERACTION On the Example of the NICE Project , 2003 .

[57]  Stefan M. Rüger,et al.  Information-theoretic semantic multimedia indexing , 2007, CIVR '07.

[58]  Carlo S. Regazzoni,et al.  From multi-sensor surveillance towards smart interactive spaces , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[59]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[60]  Samy Bengio,et al.  Multimodal Authentication Using Asynchronous HMMs , 2003, AVBPA.

[61]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[62]  Sharon L. Oviatt,et al.  Taming recognition errors with a multimodal interface , 2000, CACM.

[63]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[64]  Aristodemos Pnevmatikakis,et al.  Real Time Audio-Visual Person Tracking , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[65]  Thanassis Rikakis,et al.  Computational models for experiences in the arts, and multimedia , 2003, ETP '03.

[66]  Jean-Marc Odobez,et al.  Audio-visual speaker tracking with importance particle filters , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[67]  Gérard Chollet,et al.  Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[68]  John R. Smith,et al.  Data Modeling Strategies for Imbalanced Learning in Visual Search , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[69]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[70]  Benoît Maison,et al.  Joint processing of audio and visual information for multimedia indexing and human-computer interaction , 2000, RIAO.

[71]  Min Xu,et al.  Efficient sampling of training set in large and noisy multimedia data , 2007, TOMCCAP.

[72]  Ren C. Luo,et al.  Multisensor fusion and integration: approaches, applications, and future research directions , 2002 .

[73]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[74]  Shih-Fu Chang,et al.  Layered dynamic mixture model for pattern discovery in asynchronous multi-modal streams [video applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[75]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[76]  S. Son,et al.  GROUP-BASED EVENT DETECTION IN UNDERSEA SENSOR NETWORKS , 2005 .

[77]  Changsheng Xu,et al.  A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video , 2008, IEEE Transactions on Multimedia.

[78]  Chris Stauffer,et al.  Automated Audio-visual Activity Analysis , 2005 .

[79]  James Llinas,et al.  An introduction to multisensor data fusion , 1997, Proc. IEEE.

[80]  R. Manmatha,et al.  Using Maximum Entropy for Automatic Image Annotation , 2004, CIVR.

[81]  Ajay Divakaran Multimedia Content Analysis: Theory and Applications , 2008 .

[82]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[83]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[84]  Seppo Puuronen,et al.  MULTILEVEL CONTEXT REPRESENTATION USING SEMANTIC METANETWORK , 1997 .

[85]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[86]  Mohan S. Kankanhalli,et al.  Experiential Sampling on Multiple Data Streams , 2006, IEEE Transactions on Multimedia.

[87]  Pradeep K. Atrey,et al.  Smart mirror for ambient home environment , 2007 .

[88]  Christopher Town,et al.  Multi-sensory and Multi-modal Fusion for Sentient Computing , 2007, International Journal of Computer Vision.

[89]  S. Iyengar,et al.  Multi-Sensor Fusion: Fundamentals and Applications With Software , 1997 .

[90]  Hai Leong Chieu,et al.  Query based event extraction along a timeline , 2004, SIGIR '04.

[91]  Harriet J. Nock,et al.  Discriminative model fusion for semantic concept detection and annotation in video , 2003, ACM Multimedia.

[92]  Denis Pellerin,et al.  Video classification based on low-level feature fusion model , 2005, 2005 13th European Signal Processing Conference.

[93]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[94]  Ramesh Jain,et al.  Experiential Sampling for video surveillance , 2003, IWVS '03.

[95]  Sascha Spors,et al.  Joint audio-video object localization and tracking , 2001 .

[96]  Mel Siegel,et al.  Sensor data fusion for context-aware computing using dempster-shafer theory , 2004 .

[97]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[98]  Harriet J. Nock,et al.  Audio-visual synchrony for detection of monologues in video archives , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[99]  Ying Liu,et al.  Integrating Semantic Templates with Decision Tree for Image Semantic Learning , 2007, MMM.

[100]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[101]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.

[102]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[103]  Ishwar K. Sethi,et al.  Audio-visual talking face detection , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[104]  Rainer Stiefelhagen,et al.  Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures , 2004, ICMI '04.

[105]  Ruzena Bajcsy,et al.  The Sensor Selection Problem for Bounded Uncertainty Sensing Models , 2005, IEEE Transactions on Automation Science and Engineering.

[106]  Shih-Fu Chang,et al.  Combining text and audio-visual features in video indexing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[107]  Uwe Aickelin,et al.  Anomaly Detection Using the Dempster-Shafer Method , 2006, DMIN.

[108]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[109]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[110]  J. Jacko,et al.  The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications , 2002 .

[111]  Xian-Sheng Hua,et al.  An Attention-Based Decision Fusion Scheme for Multimedia Information Retrieval , 2004, PCM.

[112]  Eindhoven,et al.  Ep 1 881 486 B1 European Patent Specification Gb-a-2 353 926 @bullet Faller C Et Al: "efficient Representation of Spatial Audio Using Perceptual Parametrization" Ieee Workshop on Applications of Signal Processing to Audio and Acoustics , .

[113]  Vlasta Radová,et al.  An approach to speaker identification using multiple classifiers , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  Samy Bengio,et al.  Database, protocols and tools for evaluating score-level fusion algorithms in biometric authentication , 2006, Pattern Recognit..

[115]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[116]  H.K. Ekenel,et al.  Kalman filters for audio-video source localization , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[117]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[118]  Shengbing Jiang,et al.  Optimal sensor selection for discrete-event systems with partial observation , 2003, IEEE Trans. Autom. Control..

[119]  Aggelos K. Katsaggelos,et al.  Optimal sensor selection for video-based target tracking in a wireless sensor network , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[120]  Vladimir Pavlovic,et al.  Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection , 2002, Object recognition supported by user interaction for service robots.

[121]  Norbert Pfleger,et al.  Context based multimodal fusion , 2004, ICMI '04.

[122]  Lianhong Cai,et al.  Multi-level Fusion of Audio and Visual Features for Speaker Identification , 2006, ICB.

[123]  Marcel Worring,et al.  A review on multimodal video indexing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[124]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[125]  Yi Ding,et al.  Segmental Hidden Markov Models for View-based Sport Video Analysis , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[126]  Shih-Fu Chang,et al.  Story boundary detection in large broadcast news video archives: techniques, experience and trends , 2004, MULTIMEDIA '04.

[127]  Mel Siegel,et al.  Confidence fusion [sensor fusion] , 2004, International Workshop on Robot Sensing, 2004. ROSE 2004..

[128]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[129]  Trevor Darrell,et al.  Ausio-visual Segmentation and "The Cocktail Party Effect" , 2000, ICMI.

[130]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[131]  Mau-Tsuen Yang,et al.  A multimodal fusion system for people detection and tracking , 2005, Int. J. Imaging Syst. Technol..

[132]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[133]  Michael Wagner,et al.  Audio-visual multimodal fusion for biometric person authentication and liveness verification , 2006 .

[134]  Mohan S. Kankanhalli,et al.  Goal-oriented optimal subset selection of correlated multimedia streams , 2007, TOMCCAP.

[135]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[136]  Bakkama Srinath Reddy,et al.  Evidential Reasoning for Multimodal Fusion in Human Computer Interaction , 2007 .

[137]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[138]  Thijs Westerveld,et al.  Image Retrieval: Content versus Context , 2000, RIAO.

[139]  Christophe Andrieu,et al.  Particle methods for change detection, system identification, and control , 2004, Proceedings of the IEEE.

[140]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[141]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[142]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[143]  Samy Bengio,et al.  Confidence measures for multimodal identity verification , 2002, Inf. Fusion.

[144]  Gérard Chollet,et al.  Audiovisual Speech Synchrony Measure: Application to Biometrics , 2007, EURASIP J. Adv. Signal Process..

[145]  Edward Y. Chang,et al.  Multimodal metadata fusion using causal strength , 2005, ACM Multimedia.

[146]  Nebojsa Jojic,et al.  Audio-visual graphical models for speech processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[147]  Tat-Seng Chua,et al.  Fusion of AV features and external information sources for event detection in team sports video , 2006, TOMCCAP.

[148]  Sharon Oviatt,et al.  Multimodal Interfaces , 2008, Encyclopedia of Multimedia.

[149]  Gérard Chollet,et al.  BIOMET: A Multimodal Person Authentication Database Including Face, Voice, Fingerprint, Hand and Signature Modalities , 2003, AVBPA.

[150]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[151]  J. B. Mena COLOR IMAGE SEGMENTATION USING THE DEMPSTER-SHAFER THEORY OF EVIDENCE FOR THE FUSION OF TEXTURE , 2003 .

[152]  Huosheng Hu,et al.  CSM-422 Sensors and Data Fusion Algorithms in Mobile Robotics , 2005 .

[153]  Shuzhi Sam Ge,et al.  Motion estimation using audio and video fusion , 2004, ICARCV 2004 8th Control, Automation, Robotics and Vision Conference, 2004..

[154]  Hong Yan,et al.  Comparison of face verification results on the XM2VTFS database , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[155]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[156]  Juan J. Igarza,et al.  MCYT baseline corpus: a bimodal biometric database , 2003 .

[157]  Mohan S. Kankanhalli,et al.  Experiential Sampling in Multimedia Systems , 2006, IEEE Transactions on Multimedia.

[158]  Ben J. A. Kröse,et al.  EM detection of common origin of multi-modal cues , 2006, ICMI '06.