A Study of F0 Modification for X-Vector Based Speech Pseudonymization Across Gender

Speech pseudonymization aims at altering a speech signal to map the identifiable personal characteristics of a given speaker to another identity. In other words, it aims to hide the source speaker identity while preserving the intelligibility of the spoken content. This study takes place in the VoicePrivacy 2020 challenge framework, where the baseline system performs pseudonymization by modifying x-vector information to match a target speaker while keeping the fundamental frequency (F0) unchanged. We propose to alter other paralinguistic features, here F0, and analyze the impact of this modification across gender. We found that the proposed F0 modification always improves pseudonymization. We observed that both source and target speaker genders affect the performance gain when modifying the F0.

[1]  Marc Tommasi,et al.  Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Stephen McAdams,et al.  Spectral fusion, spectral parsing and the formation of auditory images , 1984 .

[3]  Daniel Erro,et al.  Reversible speaker de-identification using pre-trained transformation functions , 2017, Comput. Speech Lang..

[4]  Mark Hasegawa-Johnson,et al.  F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[6]  Marc Tommasi,et al.  Design Choices for X-vector Based Speaker Anonymization , 2020, INTERSPEECH.

[7]  Junichi Yamagishi,et al.  Speaker Anonymization Using X-vector and Neural Waveform Models , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[8]  Tetsuya Takiguchi,et al.  Individuality-Preserving Spectrum Modification for Articulation Disorders Using Phone Selective Synthesis , 2015, SLPAT@Interspeech.

[9]  Yu Tsao,et al.  Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion , 2020, IEEE Transactions on Emerging Topics in Computational Intelligence.

[10]  E. Vincent,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.

[11]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[14]  Isabel Trancoso,et al.  The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding , 2019, INTERSPEECH.

[15]  Sanjeev Khudanpur,et al.  Probing the Information Encoded in X-Vectors , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Carlos Gussenhoven The Phonology of Tone and Intonation: Pitch in Language I: Stress and Intonation , 2004 .

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Xin Wang,et al.  Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.