Multimodal speaker verification using ancillary known speaker characteristics such as gender or age

Multimodal speaker verification based on easy-to-obtain biometric traits such as face and voice is rapidly gaining acceptance as the preferred technology for many applications. In many such practical applications, other characteristics of the speaker such as gender or age are known and may be exploited for enhanced verification accuracy. In this paper we present a parallel approach determining gender as an ancillary speaker characteristic, which is incorporated in the decision of a facevoice speaker verification system. Preliminary experiments with the DaFEx multimodal audio-video database show that fusing the results of gender recognition and identity verification improves the performance of multimodal speaker verification.

[1]  Gérard Chollet,et al.  Audiovisual Speech Synchrony Measure: Application to Biometrics , 2007, EURASIP J. Adv. Signal Process..

[2]  Nicholas Costen,et al.  Sparse models for gender classification , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[3]  Brunelli Poggio,et al.  HyberBF Networks for Gender Classification , 1992 .

[4]  Garrison W. Cottrell,et al.  EMPATH: Face, Emotion, and Gender Recognition Using Holons , 1990, NIPS.

[5]  Florian Metze,et al.  Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Terrence J. Sejnowski,et al.  SEXNET: A Neural Network Identifies Sex From Human Faces , 1990, NIPS.

[7]  Arun Ross,et al.  An introduction to biometric recognition , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Anil K. Jain,et al.  Biometric Systems: Technology, Design and Performance Evaluation , 2004 .

[9]  Jiri Matas,et al.  Combining evidence in personal identity verification systems , 1997, Pattern Recognit. Lett..

[10]  Paul A. Viola,et al.  A unified learning framework for real time face detection and classification , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[11]  Christian A. Müller,et al.  Exploiting speech for recognizing elderly users to respond to their special needs , 2003, INTERSPEECH.

[12]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[13]  Keikichi Hirose,et al.  Automatic estimation of perceptual age using speaker modeling techniques , 2003, INTERSPEECH.

[14]  A.K. Jain,et al.  Scars, marks and tattoos (SMT): Soft biometric for suspect and victim identification , 2008, 2008 Biometrics Symposium.

[15]  Stefan Fischer,et al.  Fusion of audio and video information for multi modal person authentication , 1997, Pattern Recognit. Lett..

[16]  F. Pianesi,et al.  An Italian Database of Emotional Speech and Facial Expressions , 2006 .

[17]  Ashok Samal,et al.  Analysis of sexual dimorphism in human face , 2007, J. Vis. Commun. Image Represent..

[18]  Rainer Lienhart,et al.  Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection , 2003, DAGM-Symposium.

[19]  Michael Wagner,et al.  Robust face-voice based speaker identity verification using multilevel fusion , 2008, Image Vis. Comput..