Playing a Part: Speaker Verification at the movies

The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb; (ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of state-of-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement.

[1]  Zhifeng Xie,et al.  ResNet and Model Fusion for Automatic Spoofing Detection , 2017, INTERSPEECH.

[2]  Alan McCree,et al.  Supervised domain adaptation for I-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[4]  Thomas Fang Zheng,et al.  Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning , 2020, INTERSPEECH.

[5]  Xuanjing Huang,et al.  Adversarial Multi-task Learning for Text Classification , 2017, ACL.

[6]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[7]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[9]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[10]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[11]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[13]  Haizhou Li,et al.  Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  M. Graciarena,et al.  THE SPEAKERS IN THE WILD SPEAKER RECOGNITION CHALLENGE PLAN , 2016 .

[15]  Ming Li,et al.  Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion , 2017, INTERSPEECH.

[16]  A Hirson,et al.  Glottal fry and voice disguise: a case study in forensic phonetics. , 1993, Journal of biomedical engineering.

[17]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Alan McCree,et al.  Jhu-HLTCOE System for the Voxsrc Speaker Recognition Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[22]  Lukás Burget,et al.  Speaker Verification Using End-to-end Adversarial Language Adaptation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tao Jiang,et al.  Training Multi-task Adversarial Network for Extracting Noise-robust Speaker Embedding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Steve Renals,et al.  Channel Adversarial Training for Speaker Verification and Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Patrick Kenny,et al.  Adapting End-to-end Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[27]  Biing-Hwang Juang,et al.  Adversarial Feature-Mapping for Speech Enhancement , 2018, INTERSPEECH.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Haizhou Li,et al.  Long Range Acoustic Features for Spoofed Speech Detection , 2019, INTERSPEECH.

[30]  A. Reich,et al.  Effects of selected vocal disguises upon speaker identification by listening. , 1979, The Journal of the Acoustical Society of America.

[31]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[32]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[33]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[34]  Patrick Kenny,et al.  Generative Adversarial Speaker Embedding Networks for Domain Robust End-to-end Speaker Verification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Regina Barzilay,et al.  Aspect-augmented Adversarial Networks for Domain Adaptation , 2017, TACL.

[36]  Lukás Burget,et al.  Analysis of Score Normalization in Multilingual Speaker Recognition , 2017, INTERSPEECH.

[37]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Andrew Zisserman,et al.  From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script , 2017, BMVC.

[39]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Condensed Movies: Story Based Retrieval with Contextual Embeddings , 2020, ACCV.

[41]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[42]  Joon Son Chung,et al.  Delving into VoxCeleb: environment invariant speaker recognition , 2019, ArXiv.

[43]  Niko Brümmer,et al.  Unsupervised Domain Adaptation for I-Vector Speaker Recognition , 2014, Odyssey.

[44]  Joon Son Chung,et al.  Clova Baseline System for the VoxCeleb Speaker Recognition Challenge 2020 , 2020, ArXiv.

[45]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[46]  Yu Tsao,et al.  Noise Adaptive Speech Enhancement using Domain Adversarial Training , 2018, INTERSPEECH.