论文信息 - Speaker information modification in the VoicePrivacy 2020 toolchain

Speaker information modification in the VoicePrivacy 2020 toolchain

This paper presents a study of the baseline system of the VoicePrivacy 2020 challenge. This baseline relies on a voice conversion system that aims at separating speaker identity and linguistic contents for a given speech utterance. To generate an anonymized speech waveform, the neural acoustic model and neural waveform model use the related linguistic content together with a selected pseudo-speaker identity. The linguistic content is estimated using bottleneck features extracted from a triphone classifier while the speaker information is extracted then modified to target a pseudo-speaker identity in the xvector’s space. In this work, we first proposed to replace the triphone-based bottleneck features extractor that requires supervised training by an end-to-end Automatic Speech Recognition (ASR) system. In this framework, we explored the use of adversarial and semi-adversarial training to learn linguistic features while masking speaker information. Last, we explored several anonymization schemes to introspect which module benefits the most from the generated pseudo-speaker identities.

D. Jouvet | A. Larcher | Pierre Champion

[1] Stephen McAdams,et al. Spectral fusion, spectral parsing and the formation of auditory images , 1984 .

[2] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[3] Jerzy Sas,et al. Gender recognition using neural networks and ASR techniques , 2013 .

[4] Haizhou Li,et al. Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[7] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[8] Hao Wang,et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[9] Daniel Erro,et al. Reversible speaker de-identification using pre-trained transformation functions , 2017, Comput. Speech Lang..

[10] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[11] Seyed Hamidreza Mohammadi,et al. An overview of voice conversion systems , 2017, Speech Commun..

[12] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[13] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Shinnosuke Takamichi,et al. Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Tetsuji Ogawa,et al. Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[17] Dan Qu,et al. Towards end-to-end speech recognition with transfer learning , 2018, EURASIP Journal on Audio, Speech, and Music Processing.

[18] Yoshua Bengio,et al. Learning Anonymized Representations with Adversarial Neural Networks , 2018, ArXiv.

[19] Albert Y. S. Lam,et al. Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[20] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[21] Junichi Yamagishi,et al. Speaker Anonymization Using X-vector and Neural Waveform Models , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[22] Xin Wang,et al. Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis , 2019, ArXiv.

[23] Francis Bach,et al. Partially Encrypted Deep Learning using Functional Encryption , 2019, NeurIPS.

[24] Nicolas Usunier,et al. To Reverse the Gradient or Not: an Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Marc Tommasi,et al. Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion? , 2019, INTERSPEECH.

[26] Isabel Trancoso,et al. The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding , 2019, INTERSPEECH.

[27] Junichi Yamagishi,et al. Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.