Talking-Face Identity Verification, Audiovisual Forgery, and Robustness Issues

The robustness of a biometric identity verification (IV) system is best evaluated by monitoring its behavior under impostor attacks. Such attacks may include the transformation of one, many, or all of the biometric modalities. In this paper, we present the transformation of both speech and visual appearance of a speaker and evaluate its effects on the IV system. We propose MixTrans, a novel method for voice transformation. MixTrans is a mixture-structured bias voice transformation technique in the cepstral domain, which allows a transformed audio signal to be estimated and reconstructed in the temporal domain. We also propose a face transformation technique that allows a frontal face image of a client speaker to be animated. This technique employs principal warps to deform defined MPEG-4 facial feature points based on determined facial animation parameters (FAPs). The robustness of the IV system is evaluated under these attacks.

[1]  David J. Kriegman,et al.  What is the set of images of an object under all possible lighting conditions? , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[3]  Julian Fiérrez,et al.  A Comparative Evaluation of Fusion Strategies for Multimodal Biometric Verification , 2003, AVBPA.

[4]  Gérard Chollet,et al.  Audio-Visual Speech Synchrony Measure for Talking-Face Identity Verification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Jean Duchon,et al.  Interpolation des fonctions de deux variables suivant le principe de la flexion des plaques minces , 1976 .

[6]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[9]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[11]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[12]  G. Erten,et al.  Enhanced silence detection in variable rate coding systems using voice extraction , 2000, Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (Cat.No.CH37144).

[13]  Witold Pedrycz,et al.  Face recognition: A study in information fusion using fuzzy integral , 2005, Pattern Recognit. Lett..

[14]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[15]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[16]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[17]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Kuldip K. Paliwal,et al.  Fast feature extraction method for robust face verification , 2002 .

[19]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[21]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[22]  Ioannis Pitas,et al.  Multimodal decision-level fusion for person authentication , 1999, IEEE Trans. Syst. Man Cybern. Part A.

[23]  H. Ney,et al.  VTLN-based cross-language voice conversion , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[24]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[26]  Jan Zelinka,et al.  Comparison between GMM and decision graphs based silence/speech detection method , 2006 .

[27]  Chafic Mokbel,et al.  Online adaptation of HMMs to real-life conditions: a unified framework , 2001, IEEE Trans. Speech Audio Process..

[28]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[29]  Hui Ye,et al.  Voice conversion for unknown speakers , 2004, INTERSPEECH.

[30]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[31]  Erik Hjelmås,et al.  Face Detection: A Survey , 2001, Comput. Vis. Image Underst..

[32]  Gérard Chollet,et al.  Audiovisual Speech Synchrony Measure: Application to Biometrics , 2007, EURASIP J. Adv. Signal Process..

[33]  Alex Pentland,et al.  Probabilistic Visual Learning for Object Representation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  E. Mayoraz,et al.  Fusion of face and speech data for person identity verification , 1999, IEEE Trans. Neural Networks.

[35]  Fred L. Bookstein,et al.  Principal Warps: Thin-Plate Splines and the Decomposition of Deformations , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[37]  Juyang Weng,et al.  Using Discriminant Eigenfeatures for Image Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[39]  Gérard Chollet,et al.  Voice forgery using ALISP: indexation in a client memory , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[40]  Chafic Mokbel,et al.  BECARS: a free software for speaker verification , 2004, Odyssey.

[41]  Kishore Prahallad,et al.  Source and system features for speaker recognition using AANN models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[42]  Gérard Chollet,et al.  Making talking-face authentication robust to deliberate imposture , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Saeed Vaseghi,et al.  Evaluation of methods for parameteric formant transformation in voice conversion , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[45]  Yannis Stylianou,et al.  A system for voice conversion based on probabilistic classification and a harmonic plus noise model , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[46]  Hui Ye,et al.  Perceptually weighted linear transformations for voice conversion , 2003, INTERSPEECH.

[47]  David J. Kriegman,et al.  What Is the Set of Images of an Object Under All Possible Illumination Conditions? , 1998, International Journal of Computer Vision.

[48]  Nicholas Costen,et al.  Manifold caricatures: on the psychological consistency of computer face recognition , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[49]  Sabah Jassim,et al.  The SecurePhone PDA Database, Experimental Protocol and Automatic Test Procedure for Multimodal User , 2006 .

[50]  Gérard Chollet,et al.  BIOMET: A Multimodal Person Authentication Database Including Face, Voice, Fingerprint, Hand and Signature Modalities , 2003, AVBPA.

[51]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[52]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.