论文信息 - Multimodal Speaker Identification Using Canonical Correlation Analysis

Multimodal Speaker Identification Using Canonical Correlation Analysis

In this work, we explore the use of canonical correlation analysis to improve the performance of multimodal recognition systems that involve multiple correlated modalities. More specifically, we consider the audiovisual speaker identification problem, where speech and lip texture (or intensity) modalities are fused in an open-set identification framework. Our motivation is based on the following observation. The late integration strategy, which is also referred to as decision or opinion fusion, is effective especially in case the contributing modalities are uncorrelated and thus the resulting partial decisions are statistically independent. Early integration techniques on the other hand can be favored only if a couple of modalities are highly correlated. However, coupled modalities such as audio and lip texture also consist of some components that are mutually independent. Thus we first perform a cross-correlation analysis on the audio and lip modalities so as to extract the correlated part of the information, and then employ an optimal combination of early and late integration techniques to fuse the extracted features. The results of the experiments testing the performance of the proposed system are also provided

A. Murat Tekalp | Engin Erzin | Yücel Yemez | Mehmet Emre Sargin

[1] Trevor Darrell,et al. Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[2] A. Murat Tekalp,et al. Multimodal speaker identification using an adaptive classifier cascade based on modality reliability , 2005, IEEE Transactions on Multimedia.

[3] M. Turk,et al. Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[4] Sridha Sridharan,et al. Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification , 2001, Digit. Signal Process..

[5] Jean-Marc Odobez,et al. Robust Multiresolution Estimation of Parametric Motion Models , 1995, J. Vis. Commun. Image Represent..

[6] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[7] A. Murat Tekalp,et al. Discriminative lip-motion features for biometric speaker identification , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[8] A. Murat Tekalp,et al. Lip feature extraction based on audio-visual correlation , 2005, 2005 13th European Signal Processing Conference.

[9] Jr. J.P. Campbell,et al. Speaker recognition: a tutorial , 1997, Proc. IEEE.

[10] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[11] Robert Frischholz,et al. BioID: A Multimodal Biometric Identification System , 2000, Computer.