Fully automatic face recognition system using a combined audio-visual approach

This paper presents a novel audio and video information fusion approach that greatly improves automatic recognition of people in video sequences. To that end, audio and video information is first used independently to obtain confidence values that indicate the likelihood that a specific person appears in a video shot. Finally, a post-classifier is applied to fuse audio and visual confidence values. The system has been tested on several news sequences and the results indicate that a significant improvement in the recognition rate can be achieved when both modalities are used together.

[1]  Edward J. Delp,et al.  Video preprocessing for audiovisual indexing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Tim Wark,et al.  Multi-modal speech processing for automatic speaker recognition , 2001 .

[3]  Mahesh Viswanathan,et al.  Information access using speech, speaker and face recognition , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[4]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[5]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[6]  Zhu Liu,et al.  Integration of audio and visual information for content-based video segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[7]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[8]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[9]  Gerhard Rigoll,et al.  Recognition of JPEG compressed face images based on statistical methods , 2000, Image Vis. Comput..

[10]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[11]  Richard J. Mammone,et al.  Channel estimation and normalization by coherent spectral averaging for robust speaker recognition , 2000 .

[12]  Charles A. Bouman,et al.  ViBE: a compressed video database structured for active browsing and search , 2004, IEEE Transactions on Multimedia.

[13]  Marc Acheroy,et al.  A Contribution to Multi-Modal Identity Verification Using D ecision Fusion , 1999 .

[14]  Norman Poh,et al.  Hybrid Biometric Person Authentication Using Face and Voice Features , 2001, AVBPA.

[15]  Hyeonjoon Moon,et al.  The FERET verification testing protocol for face recognition algorithms , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[16]  Tanzeem Choudhury,et al.  Multimodal person recognition using unconstrained audio and video , 1998 .

[17]  Timothy F. Cootes,et al.  Automatic Interpretation and Coding of Face Images Using Flexible Models , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Benoît Maison,et al.  Audio-visual speaker recognition for video broadcast news: some fusion techniques , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  James M. Rehg,et al.  Boosted audio-visual HMM for speech reading , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[21]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[22]  Benoît Maison,et al.  Audio-Visual Speaker Recognition for Video Broadcast News , 2001, J. VLSI Signal Process..

[23]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[24]  Qi Li,et al.  A detection approach to search-space reduction for HMM state alignment in speaker verification , 2001, IEEE Trans. Speech Audio Process..

[25]  Channel estimation and normalization by coherent spectral averaging for robust speaker verification , 1999, EUROSPEECH.

[26]  Luis Torres,et al.  Automatic face recognition for video indexing applications , 2002, Pattern Recognit..

[27]  Conrad Sanderson,et al.  Automatic Person Verification Using Speech and Face Information , 2003 .

[28]  Li Wu,et al.  A Survey of Face Recognition , 2006 .

[29]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[30]  Jiri Matas,et al.  Learning support vectors for face verification and recognition , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[31]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[32]  Gang Wei,et al.  TV program classification based on face and text processing , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[33]  A. A. Colomer Video indexing using multimodal information , 2003 .

[34]  Harriet J. Nock,et al.  Semantic indexing of multimedia using audio, text and visual cues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[35]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Avideh Zakhor,et al.  Applications of Video-Content Analysis and Retrieval , 2002, IEEE Multim..