Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation

With the increasing interest over speech technologies, numerous Automatic Speaker Verification (ASV) systems are employed to perform person identification. In the latter context, the systems rely on neural embeddings as a speaker representation. Nonetheless, such representations may contain privacy sensitive information about the speakers (e.g. age, sex, ethnicity, ...). In this paper, we introduce the concept of attribute-driven privacy preservation that enables a person to hide one or a few personal aspects to the authentication component. As a first solution we define an adversarial autoencoding method that disentangles a given speaker attribute from its neural representation. The proposed approach is assessed with a focus on the sex attribute. Experiments carried out using the VoxCeleb data sets have shown that the defined model enables the manipulation (i.e. variation or hiding) of this attribute while preserving good ASV performance.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Marc Tommasi,et al.  Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Shrikanth Narayanan,et al.  Robust Speaker Recognition Using Unsupervised Adversarial Invariance , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  A. Nautsch,et al.  The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment , 2020, INTERSPEECH.

[5]  Oliver Kosut,et al.  On information-theoretic privacy with general distortion cost functions , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[6]  Stephan Mandt,et al.  Disentangled Sequential Autoencoder , 2018, ICML.

[7]  Niko Brümmer,et al.  The PAV algorithm optimizes binary proper scoring rules , 2013, ArXiv.

[8]  Junichi Yamagishi,et al.  Speaker Anonymization Using X-vector and Neural Waveform Models , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[9]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[12]  Nicholas W. D. Evans,et al.  Preserving privacy in speaker and speech characterisation , 2019, Comput. Speech Lang..

[13]  Simon King,et al.  Disentangling Style Factors from Speaker Representations , 2019, INTERSPEECH.

[14]  Joaquin Gonzalez-Rodriguez,et al.  Reliable support: Measuring calibration of likelihood ratios. , 2013, Forensic science international.

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  Bhiksha Raj,et al.  Privacy-Preserving Speaker Authentication , 2012, ISC.

[17]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[18]  Sanjeev Khudanpur,et al.  Probing the Information Encoded in X-Vectors , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  James Glass,et al.  A Factorial Deep Markov Model for Unsupervised Disentangled Representation Learning from Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[21]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Kou Tanaka,et al.  ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[24]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Marc Tommasi,et al.  Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion? , 2019, INTERSPEECH.

[26]  Isabel Trancoso,et al.  The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding , 2019, INTERSPEECH.

[27]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[28]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[29]  Rainer Martin,et al.  Gender Discrimination Versus Speaker Identification Through Privacy-Aware Adversarial Feature Extraction , 2018, ITG Symposium on Speech Communication.

[30]  E. Vincent,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.

[31]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[32]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[33]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[34]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[35]  J D Lewis,et al.  Sex vs. gender. , 2001, Journal of the American Dental Association.

[36]  C. Champod,et al.  ENFSI guIdElINE For EvaluatIvE rEportINg IN ForENSIc ScIENcE Strengthening the Evaluation of Forensic Results across Europe ( STEOFRAE , 2015 .

[37]  H. Haddadi,et al.  Privacy-preserving Voice Analysis via Disentangled Representations , 2020, CCSW@CCS.

[38]  James Glass,et al.  Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Isabel Trancoso,et al.  Pathological speech detection using x-vector embeddings , 2020, ArXiv.

[40]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.