Audio-Visual Feature Fusion for Speaker Identification

Analyses of facial and audio features have been considered separately in conventional speaker identification systems. Herein, we propose a robust algorithm for text-independent speaker identification based on a decision-level and feature-level fusion of facial and audio features. The suggested approach makes use of Mel-frequency Cepstral Coefficients (MFCCs) for audio signal processing, Viola-Jones Haar cascade algorithm for face detection from video, eigenface features (EFF) and Gaussian Mixture Models (GMMs) for feature-level and decision-level fusion of audio and video. Decision-level fusion is carried out using PCA for face and GMM for audio through AND voting. Feature-level fusion is investigated by combining both MFCC (audio) and PCA (face) features to construct a hybrid GMM for each speaker. Testing on GRID, a multi-speaker audio-visual database, shows that the decision-level fusion of PCA (face) and GMM (audio) achieves 98.2 % accuracy and it is almost 15 % more efficient than feature-level fusion.

[1]  DeLiang Wang,et al.  Robust speaker identification using a CASA front-end , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[3]  Gérard Chollet,et al.  Audio-visual Identity Verification: An Introductory Overview , 2005, WNSP.

[4]  Noureddine Doghmane,et al.  Face and Speech Based Multi-Modal Biometric Authentication , 2010 .

[5]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[6]  Mark A Gregory,et al.  A novel approach for MFCC feature extraction , 2010, 2010 4th International Conference on Signal Processing and Communication Systems.

[7]  Daniel J. Mashao,et al.  Combining classifier decisions for robust speaker identification , 2006, Pattern Recognit..

[8]  I. Paliy,et al.  Face detection using Haar-like features cascade and convolutional neural network , 2008, 2008 International Conference on "Modern Problems of Radio Engineering, Telecommunications and Computer Science" (TCSET).

[9]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[10]  Rainer Stiefelhagen,et al.  Why Is Facial Occlusion a Challenging Problem? , 2009, ICB.

[11]  Yannis Stylianou,et al.  Progress in Nonlinear Speech Processing, Workshop on Nonlinear Speech Processing, WNSP 2005, Heraklion, Crete, Greece, September 20-23, 2005 , 2007, WNSP.

[12]  Margaret Lech,et al.  Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models , 2009, 2009 Third International Conference on Network and System Security.

[13]  Anupam Shukla,et al.  Multilingual speaker recognition using ANFIS , 2010, 2010 2nd International Conference on Signal Processing Systems.

[14]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[15]  Stan Z. Li,et al.  Advances in Biometrics, International Conference, ICB 2007, Seoul, Korea, August 27-29, 2007, Proceedings , 2007, ICB.