Audio-visual mutual dependency models for biometric liveness checks

In this paper we propose liveness checking technique for multimodal biometric authentication systems based on audiovisual mutual dependency models. Liveness checking ensures that biometric cues are acquired from a live person who is actually present at the time of capture for authenticating the identity. The liveness check based on mutual dependency models is performed by fusion of acoustic and visual speech features which measure the degree of synchrony between the lips and the voice extracted from speaking face video sequences. Performance evaluation in terms of DET (Detector Error Tradeoff) curves and EERs(Equal Error Rates) on publicly available audiovisual speech databases show a significant improvement in performance of proposed fusion of face-voice features based on mutual dependency models.

[1]  Roland Göcke,et al.  Statistical analysis of the relationship between audio and video speech parameters for Australian English , 2003, AVSP.

[2]  Peter W. McOwan,et al.  The algorithms of natural vision: the multi-channel gradient model , 1995 .

[3]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[4]  Conrad Sanderson,et al.  Biometric Person Recognition: Face, Speech and Fusion , 2008 .

[5]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[6]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[7]  Michael Wagner,et al.  Robust face-voice based speaker identity verification using multilevel fusion , 2008, Image Vis. Comput..

[8]  Paul Mineiro,et al.  Bayesian Robustification for Audio Visual Fusion , 1997, NIPS.

[9]  Anuj Srivastava,et al.  Face recognition using optimal linear components of range images , 2006, Image Vis. Comput..

[10]  F. Pianesi,et al.  An Italian Database of Emotional Speech and Facial Expressions , 2006 .

[11]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[12]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[13]  Osamu Fujimura 1990 International Conference on Spoken Language Processing , 1992 .

[14]  Ara V. Nefian,et al.  Audio-visual continuous speech recognition using a coupled hidden Markov model , 2002, INTERSPEECH.

[15]  John J. Foxe,et al.  Multisensory auditory-visual interactions during early sensory processing in humans: a high-density electrical mapping study. , 2002, Brain research. Cognitive brain research.

[16]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[17]  Patrick J. Flynn,et al.  A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition , 2006, Comput. Vis. Image Underst..

[18]  Sabri Gurbuz,et al.  Multi-stream product modal audio-visual integration strategy for robust adaptive speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Ping Wang,et al.  A computer-aided MFCC-based HMM system for automatic auscultation , 2008, Comput. Biol. Medicine.

[20]  Chalapathy Neti,et al.  Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[21]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[22]  Thomas S. Huang,et al.  A new approach to integrate audio and visual features of speech , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).