Audio-Visual Biometrics

Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system's performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person's voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Chalapathy Neti,et al.  Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[3]  Aggelos K. Katsaggelos,et al.  An HMM-based speech-to-video synthesizer , 2002, IEEE Trans. Neural Networks.

[4]  Jörn Ostermann,et al.  Multimodal speech synthesis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[5]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  James Llinas,et al.  Handbook of Multisensor Data Fusion , 2001 .

[7]  Alex Park,et al.  MULTI-MODAL FACE AND SPEAKER IDENTIFICATION ON A HANDHELD DEVICE , 2003 .

[8]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[9]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[10]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[11]  Juergen Luettin,et al.  Speaker identification by lipreading , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Dominic W. Massaro,et al.  SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[13]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[14]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[15]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[16]  R. Pearl Biometrics , 1914, The American Naturalist.

[17]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[19]  Fu Jie Huang,et al.  Consideration of Lombard effect for speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[20]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[21]  Patrick J. Flynn,et al.  A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition , 2006, Comput. Vis. Image Underst..

[22]  Samy Bengio,et al.  User authentication via adapted statistical models of face images , 2006, IEEE Transactions on Signal Processing.

[23]  Samy Bengio,et al.  A statistical significance test for person authentication , 2004, Odyssey.

[24]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[25]  Thomas Wagner,et al.  SESAM: A biometric person identification system using sensor fusion , 1997, Pattern Recognit. Lett..

[26]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Aggelos K. Katsaggelos,et al.  Speech-to-video synthesis using MPEG-4 compliant visual features , 2003, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  A.M. Tekalp,et al.  Joint audio-video processing for biometric speaker identification , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[29]  Aggelos K. Katsaggelos,et al.  Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance , 2005, IEEE International Conference on Image Processing 2005.

[30]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Samy Bengio,et al.  Multimodal Authentication Using Asynchronous HMMs , 2003, AVBPA.

[32]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[33]  H.P. Graf,et al.  Lip synchronization using speech-assisted video processing , 1995, IEEE Signal Processing Letters.

[34]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[35]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[36]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[37]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[38]  A. Murat Tekalp,et al.  Multimodal speaker identification using an adaptive classifier cascade based on modality reliability , 2005, IEEE Transactions on Multimedia.

[39]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[40]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[41]  Steve Young,et al.  The HTK book , 1995 .

[42]  Jiri Matas,et al.  Combining evidence in personal identity verification systems , 1997, Pattern Recognit. Lett..

[43]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Juergen Luettin,et al.  Integrating acoustic and labial information for speaker identification and verification , 1997, EUROSPEECH.

[45]  Ralph Gross,et al.  Person identification using automatic integration of speech, lip, and face experts , 2003, WBMA '03.

[46]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[47]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[48]  Chalapathy Neti,et al.  Audio-visual speaker recognition using time-varying stream reliability prediction , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[49]  Ara V. Nefian,et al.  A Bayesian Approach to Audio-Visual Speaker Identification , 2003, AVBPA.

[50]  Juergen Luettin,et al.  Acoustic-labial speaker verification , 1997, Pattern Recognit. Lett..

[51]  Richard B. Reilly,et al.  VALID: A New Practical Audio-Visual Database, and Comparative Results , 2005, AVBPA.

[52]  J.H.L. Hansen,et al.  Environmental sniffing: noise knowledge estimation for robust speech systems , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[53]  Anil K. Jain,et al.  Hiding Biometric Data , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Abeer Alwan,et al.  Noise source models for fricative consonants , 2000, IEEE Trans. Speech Audio Process..

[55]  Benoît Maison,et al.  Audio-visual speaker recognition for video broadcast news: some fusion techniques , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[56]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  David G. Stork,et al.  Pattern Classification , 1973 .

[58]  Samy Bengio,et al.  The expected performance curve: a new assessment measure for person authentication , 2004, Odyssey.

[59]  Nalini K. Ratha,et al.  Automated Biometrics , 2001, ICAPR.

[60]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[61]  Robert Frischholz,et al.  BioID: A Multimodal Biometric Identification System , 2000, Computer.

[62]  Kuldip K. Paliwal,et al.  Spectral subband centroid features for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[63]  Sharon L. Oviatt,et al.  Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions , 2000, Hum. Comput. Interact..

[64]  Ara V. Nefian,et al.  Audio-visual speaker identification using coupled hidden Markov models , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[65]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[66]  Mark A. Clements,et al.  Automatic Speechreading with Applications to Human-Computer Interfaces , 2002, EURASIP J. Adv. Signal Process..

[67]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[68]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[69]  E. Mayoraz,et al.  Fusion of face and speech data for person identity verification , 1999, IEEE Trans. Neural Networks.

[70]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[71]  Ren C. Luo,et al.  Multisensor integration and fusion for intelligent machines and systems , 1995 .

[72]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[73]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[74]  Vlasta Radová,et al.  An approach to speaker identification using multiple classifiers , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[75]  Aggelos K. Katsaggelos,et al.  Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[76]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[77]  Aggelos K. Katsaggelos,et al.  10.8 – Exploiting Visual Information in Automatic Speech Processing , 2005 .

[78]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[79]  Norman Poh,et al.  Automated Authentication Using Hybrid Biometric System , 2002 .

[80]  Kuldip K. Paliwal,et al.  Noise compensation in a person verification system using face and multiple speech feature , 2003, Pattern Recognit..

[81]  Hani Yehia,et al.  Measuring the relation between speech acoustics and 2D facial motion , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[82]  Alan L. Yuille,et al.  Feature extraction from faces using deformable templates , 2004, International Journal of Computer Vision.

[83]  John H. L. Hansen,et al.  Environmental Sniffing: Noise Knowledge Estimation for Robust Speech Systems , 2007, IEEE Trans. Speech Audio Process..

[84]  A. Murat Tekalp,et al.  Multimodal Speaker Identification Using Canonical Correlation Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[85]  Takeo Kanade,et al.  Rotation Invariant Neural Network-Based Face Detection , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[86]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[87]  James Llinas,et al.  Multisensor Data Fusion , 1990 .

[88]  Arun Ross,et al.  An introduction to biometric recognition , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[89]  Shaogang Gong,et al.  Audio- and Video-based Biometric Person Authentication , 1997, Lecture Notes in Computer Science.

[90]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[91]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[92]  Hans Peter Graf,et al.  Robust recognition of faces and facial features with a multi-modal system , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[93]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[94]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[95]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[96]  Richard B. Reilly,et al.  Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features , 2003, AVBPA.

[97]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[98]  Roberto Brunelli,et al.  Person identification using multiple cues , 1995, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[99]  Samy Bengio,et al.  On transforming statistical models for non-frontal face verification , 2006, Pattern Recognit..

[100]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[101]  Michael Wagner,et al.  "liveness" Verification in Audio-video Authentication , 2004, INTERSPEECH.

[102]  Ioannis Pitas,et al.  Multimodal decision-level fusion for person authentication , 1999, IEEE Trans. Syst. Man Cybern. Part A.

[103]  A.M. Tekalp,et al.  Multimodal speaker identification with audio-video processing , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[104]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[105]  Farzin Deravi,et al.  Audio-visual person recognition: an evaluation of data fusion strategies , 1997 .

[106]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[107]  Jon Barker,et al.  Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models , 1999, AVSP.

[108]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[109]  Anil K. Jain,et al.  Integrating Faces and Fingerprints for Personal Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[110]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[111]  Juergen Luettin,et al.  A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[112]  Oscar N. Garcia,et al.  Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition , 1996 .

[113]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[114]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[115]  Aggelos K. Katsaggelos,et al.  Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features , 2002, EURASIP J. Adv. Signal Process..

[116]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[117]  Sridha Sridharan,et al.  Robust speaker verification via fusion of speech and lip modalities , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[118]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[119]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[120]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[121]  John D. Woodward,et al.  Biometrics: privacy's foe or privacy's friend? , 1997, Proc. IEEE.

[122]  Ralph Gross,et al.  Robust Automatic Human Identification Using Face, Mouth, and Acoustic Information , 2005, AMFG.

[123]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[124]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[125]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[126]  Josef Kittler,et al.  Fusion of multiple experts in multimodal biometric personal identity verification systems , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[127]  P. Jonathon Phillips,et al.  Face recognition based on frontal views generated from non-frontal images , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[128]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[129]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[130]  Stefan Fischer,et al.  Expert Conciliation for Multi Modal Person Authentication Systems by Bayesian Statistics , 1997, AVBPA.

[131]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[132]  Seong G. Kong,et al.  Recent advances in visual and infrared face recognition - a review , 2005, Comput. Vis. Image Underst..

[133]  Sridha Sridharan,et al.  The use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMMs , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[134]  Samy Bengio Multimodal speech processing using asynchronous Hidden Markov Models , 2004, Inf. Fusion.

[135]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..