Speaker Anonymization Using X-vector and Neural Waveform Models

The social media revolution has produced a plethora of web services to which users can easily upload and share multimedia documents. Despite the popularity and convenience of such services, the sharing of such inherently personal data, including speech data, raises obvious security and privacy concerns. In particular, a user's speech data may be acquired and used with speech synthesis systems to produce high-quality speech utterances which reflect the same user's speaker identity. These utterances may then be used to attack speaker verification systems. One solution to mitigate these concerns involves the concealing of speaker identities before the sharing of speech data. For this purpose, we present a new approach to speaker anonymization. The idea is to extract linguistic and speaker identity features from an utterance and then to use these with neural acoustic and waveform models to synthesize anonymized speech. The original speaker identity, in the form of timbre, is suppressed and replaced with that of an anonymous pseudo identity. The approach exploits state-of-the-art x-vector speaker representations. These are used to derive anonymized pseudo speaker identities through the combination of multiple, random speaker x-vectors. Experimental results show that the proposed approach is effective in concealing speaker identities. It increases the equal error rate of a speaker verification system while maintaining high quality, anonymized speech.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Lauri Juvela,et al.  The NII speech synthesis entry for Blizzard Challenge 2016 , 2016 .

[4]  Tanja Schultz,et al.  Speaker de-identification via voice transformation , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Maniyeri Jayachandran,et al.  Using Keyword Spotting and Replacement for Speech Anonymization , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[6]  C. Gussenhoven The phonology of tone and intonation , 2004 .

[7]  Junichi Yamagishi,et al.  Privacy-preserving sound to degrade automatic speaker verification performance , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[9]  Junichi Yamagishi,et al.  Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data , 2018, Odyssey.

[10]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Simon Dobrisek,et al.  Speaker de-identification using diphone recognition and speech synthesis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[13]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14]  Tomi Kinnunen,et al.  Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems , 2018, 2018 IEEE International Workshop on Information Forensics and Security (WIFS).

[15]  John H. L. Hansen,et al.  Convolutional Neural Network Based Speaker De-Identification , 2018, Odyssey.

[16]  Akagi Masato,et al.  Privacy protection for speech based on concepts of auditory scene analysis , 2011 .

[17]  Daniel Erro,et al.  Reversible speaker de-identification using pre-trained transformation functions , 2017, Comput. Speech Lang..

[18]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Nicholas W. D. Evans,et al.  Evasion and obfuscation in automatic speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Tomi Kinnunen,et al.  Who Do I Sound like? Showcasing Speaker Recognition Technology by Youtube Voice Search , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Miran Pobar,et al.  Online speaker de-identification using voice transformation , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[23]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[24]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.