Privacy-preserving Voice Analysis via Disentangled Representations

Voice User Interfaces (VUIs) are increasingly popular and built into smartphones, home assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient user experience, VUIs raise new security and privacy concerns for their users. In this paper, we focus on attribute inference attacks in the speech domain, demonstrating the potential for an attacker to accurately infer a target user's sensitive and private attributes (e.g. their emotion, sex, or health status) from deep acoustic models. To defend against this class of attacks, we design, implement, and evaluate a user-configurable, privacy-aware framework for optimizing speech-related data sharing mechanisms. Our objective is to enable primary tasks such as speech recognition and user identification, while removing sensitive attributes in the raw speech data before sharing it with a cloud service provider. We leverage disentangled representation learning to explicitly learn independent factors in the raw data. Based on a user's preferences, a supervision signal informs the filtering out of invariant factors while retaining the factors reflected in the selected preference. Our experimental evaluation over five datasets shows that the proposed framework can effectively defend against attribute inference attacks by reducing their success rates to approximately that of guessing at random, while maintaining accuracy in excess of 99% for the tasks of interest. We conclude that negotiable privacy settings enabled by disentangled representations can bring new opportunities for privacy-preserving applications.

[1]  Nikita Borisov,et al.  Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations , 2018, CCS.

[2]  Heiga Zen,et al.  Fully-Hierarchical Fine-Grained Prosody Modeling For Interpretable Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hyung-Min Park,et al.  Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech Recognition , 2019, ArXiv.

[4]  Vitaly Shmatikov,et al.  Overlearning Reveals Sensitive Attributes , 2019, ICLR.

[5]  Somesh Jha,et al.  Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting , 2017, 2018 IEEE 31st Computer Security Foundations Symposium (CSF).

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Bumsub Ham,et al.  Learning Disentangled Representation for Robust Person Re-identification , 2019, NeurIPS.

[8]  Christian Poellabauer,et al.  Towards Learning Fine-Grained Disentangled Representations from Speech , 2018, ArXiv.

[9]  Yu Tsao,et al.  Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion , 2020, IEEE Transactions on Emerging Topics in Computational Intelligence.

[10]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[11]  Nassir Navab,et al.  Fairness by Learning Orthogonal Disentangled Representations , 2020, ECCV.

[12]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  I-Fan Chen,et al.  End-to-end Anchored Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Shuicheng Yan,et al.  Look Across Elapse: Disentangled Representation Learning and Photorealistic Cross-Age Face Synthesis for Age-Invariant Face Recognition , 2018, AAAI.

[15]  Vitaly Shmatikov,et al.  Machine Learning Models that Remember Too Much , 2017, CCS.

[16]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[17]  Yuting Zhang,et al.  Learning to Disentangle Factors of Variation with Manifold Interaction , 2014, ICML.

[18]  Amos J. Storkey,et al.  Censoring Representations with an Adversary , 2015, ICLR.

[19]  Hamed Haddadi,et al.  Emotion Filtering at the Edge , 2019, SenSys-ML.

[20]  Mingyan Liu,et al.  Group Retention when Using Machine Learning in Sequential Decision Making: the Interplay between User Dynamics and Fairness , 2019, NeurIPS.

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  Shrikanth Narayanan,et al.  An empirical analysis of information encoded in disentangled neural speaker representations , 2020, ArXiv.

[23]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[25]  Jinyuan Jia,et al.  AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning , 2018, USENIX Security Symposium.

[26]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[27]  Carmela Troncoso,et al.  Protecting location privacy: optimal strategy against localization attacks , 2012, CCS.

[28]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[29]  Toniann Pitassi,et al.  Flexibly Fair Representation Learning by Disentanglement , 2019, ICML.

[30]  Stefan Bauer,et al.  On the Fairness of Disentangled Representations , 2019, NeurIPS.

[31]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[32]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[33]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[34]  Toniann Pitassi,et al.  Learning Adversarially Fair and Transferable Representations , 2018, ICML.

[35]  Kate Saenko,et al.  Domain Agnostic Learning with Disentangled Representations , 2019, ICML.

[36]  Rob Brekelmans,et al.  Invariant Representations without Adversarial Training , 2018, NeurIPS.

[37]  Amir Houmansadr,et al.  Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[38]  Hamed Haddadi,et al.  When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers , 2020, Proc. Priv. Enhancing Technol..

[39]  Seunghoon Hong,et al.  High-Fidelity Synthesis with Disentangled Representation , 2020, ECCV.

[40]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[41]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[42]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[43]  Bo Luo,et al.  I Know What You See: Power Side-Channel Attack on Convolutional Neural Network Accelerators , 2018, ACSAC.

[44]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Yi-Hsuan Tsai,et al.  Domain Adaptation for Structured Output via Disentangled Patch Representations , 2018 .

[46]  Stefan Wermter,et al.  Predictive Auxiliary Variational Autoencoder for Representation Learning of Global Speech Characteristics , 2019, INTERSPEECH.

[47]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Herman Kamper,et al.  Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks , 2020, IEEE Signal Processing Letters.

[49]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[50]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Ersin Yumer,et al.  Neural Face Editing with Intrinsic Image Disentangling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[53]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[54]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[55]  Suresh Venkatasubramanian,et al.  Disentangling Influence: Using Disentangled Representations to Audit Model Predictions , 2019, NeurIPS.

[56]  Ming Li,et al.  Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System , 2018, Odyssey.

[57]  Rajib Rana,et al.  Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends , 2020, ArXiv.

[58]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[59]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[60]  Jan Paul Kolter User-centric privacy: a usable and provider-independent privacy infrastructure , 2010 .

[61]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[62]  Thomas Drugman,et al.  Towards Achieving Robust Universal Neural Vocoding , 2018, INTERSPEECH.

[63]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[64]  Guillaume Lample,et al.  Fader Networks: Manipulating Images by Sliding Attributes , 2017, NIPS.

[65]  Ashish Shrivastava,et al.  Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Nicholas W. D. Evans,et al.  Preserving privacy in speaker and speech characterisation , 2019, Comput. Speech Lang..

[67]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[69]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[70]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[71]  James D. Edge,et al.  Audio-visual feature selection and reduction for emotion classification , 2008, AVSP.

[72]  Yusuke Ijima,et al.  DNN-Based Speech Synthesis Using Speaker Codes , 2018, IEICE Trans. Inf. Syst..

[73]  Vitaly Shmatikov,et al.  Exploiting Unintended Feature Leakage in Collaborative Learning , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[74]  Gregory H. Wakefield,et al.  Chromagram visualization of the singing voice , 1999, MAVEBA.

[75]  Shrikanth S. Narayanan,et al.  On the robustness of overall F0-only modifications to the perception of emotions in speech. , 2008, The Journal of the Acoustical Society of America.

[76]  Andrea Cavallaro,et al.  Mobile Sensor Data Anonymization , 2019 .

[77]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[78]  Giovanni Felici,et al.  Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers , 2013, Int. J. Secur. Networks.

[79]  Adam Roberts,et al.  Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models , 2017, ICLR.

[80]  Emily Mower Provost,et al.  Privacy Enhanced Multimodal Neural Representations for Emotion Recognition , 2019, AAAI.

[81]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[82]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[83]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.