Multilingual Audio-Visual Smartphone Dataset And Evaluation

Smartphones have been employed with biometric-based verification systems to provide security in highly sensitive applications. Audio-visual biometrics are getting popular due to their usability, and also it will be challenging to spoof because of their multimodal nature. In this work, we present an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset contains 103 subjects captured in three different sessions considering the different real-world scenarios. Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems. These unique characteristics of this dataset will pave the way to implement novel state-of-the-art unimodal or audio-visual speaker recognition systems. We also report the performance of the bench-marked biometric verification systems on our dataset. The robustness of biometric algorithms is evaluated towards multiple dependencies like signal noise, device, language and presentation attacks like replay and synthesized signals with extensive experiments. The obtained results raised many concerns about the generalization properties of state-of-the-art biometrics methods in smartphones.

[1]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[2]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  John H. L. Hansen,et al.  Spoken language mismatch in speaker verification: An investigation with NIST-SRE and CRSS Bi-Ling corpora , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[5]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[7]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[8]  Tal Hassner,et al.  FSGAN: Subject Agnostic Face Swapping and Reenactment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Sébastien Marcel,et al.  DeepFakes: a New Threat to Face Recognition? Assessment and Detection , 2018, ArXiv.

[10]  Xin Yang,et al.  Exposing Deep Fakes Using Inconsistent Head Poses , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[12]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  Petr Motlícek,et al.  Bi-modal authentication in mobile environments using session variability modelling , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[15]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Umar Mohammed,et al.  Probabilistic Models for Inference about Identity , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jukka Komulainen,et al.  Face anti-spoofing based on color texture analysis , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[18]  Ana F. Sequeira,et al.  MobBIO: A multimodal database captured with a portable handheld device , 2014, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[19]  Ajita Rattani,et al.  A Survey Of mobile face biometrics , 2018, Comput. Electr. Eng..

[20]  Kaiqi Huang,et al.  GP-GAN: Towards Realistic High-Resolution Image Blending , 2017, ACM Multimedia.

[21]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[22]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[24]  Alejandro F. Frangi,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. , 2022 .

[25]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[26]  Jingwen Dai,et al.  Deep Multimodal Speaker Naming , 2015, ACM Multimedia.

[27]  Christoph Busch,et al.  Cross-lingual Speaker Verification: Evaluation on X-Vector Method , 2021 .

[28]  Sushma Venkatesh,et al.  Smartphone Multi-modal Biometric Authentication: Database and Evaluation , 2019, ArXiv.

[29]  Andreas Rössler,et al.  FaceForensics++: Learning to Detect Manipulated Facial Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Richard B. Reilly,et al.  VALID: A New Practical Audio-Visual Database, and Comparative Results , 2005, AVBPA.

[31]  Thomas Fang Zheng,et al.  Cross-lingual speaker verification with deep feature learning , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[32]  Sébastien Marcel,et al.  MOBIO Database for the ICPR 2010 Face and Speech Competition , 2009 .

[33]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[34]  Ramachandra Raghavendra,et al.  Presentation Attack Detection Methods for Face Recognition Systems , 2017, ACM Comput. Surv..

[35]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[36]  Brian C. Lovell,et al.  Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference , 2009, ICB.

[37]  S. R. Mahadeva Prasanna,et al.  Audio-Visual Biometric Recognition and Presentation Attack Detection: A Comprehensive Survey , 2021, IEEE Access.

[38]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[39]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Cristian Canton-Ferrer,et al.  The Deepfake Detection Challenge (DFDC) Preview Dataset , 2019, ArXiv.

[41]  Sébastien Marcel,et al.  Biometrics Evaluation Under Spoofing Attacks , 2014, IEEE Transactions on Information Forensics and Security.

[42]  Jean-Marc Odobez,et al.  Robust and Discriminative Speaker Embedding via Intra-Class Distance Variance Regularization , 2018, INTERSPEECH.