Understanding the Tradeoffs in Client-Side Privacy for Speech Recognition

Existing approaches to ensuring privacy of user speech data primarily focus on server-side approaches. While improving server-side privacy reduces certain security concerns, users still do not retain control over whether privacy is ensured on the client-side. In this paper, we define, evaluate, and explore techniques for client-side privacy in speech recognition, where the goal is to preserve privacy on raw speech data before leaving the client’s device. We first formalize several tradeoffs in ensuring client-side privacy between performance, compute requirements, and privacy. Using our tradeoff analysis, we perform a large-scale empirical study on existing approaches and find that they fall short on at least one metric. Our results call for more research in this crucial area as a step towards safer real-world deployment of speech recognition systems at scale across mobile devices.

[1]  Rita Singh,et al.  Profiling Humans from their Voice , 2019 .

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Parameswaran Ramanathan,et al.  Prεεch: A System for Privacy-Preserving Speech Transcription , 2019, USENIX Security Symposium.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6]  Ruslan Salakhutdinov,et al.  Think Locally, Act Globally: Federated Learning with Local and Global Representations , 2020, ArXiv.

[7]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[8]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[9]  Jie Xu,et al.  Federated Learning for Healthcare Informatics , 2019, ArXiv.

[10]  H. Haddadi,et al.  Privacy-preserving Voice Analysis via Disentangled Representations , 2020, CCSW@CCS.

[11]  Marc Tommasi,et al.  Design Choices for X-vector Based Speaker Anonymization , 2020, INTERSPEECH.

[12]  Hung-yi Lee,et al.  One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[13]  Joseph Dureau,et al.  Federated Learning for Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[15]  Feifei Li,et al.  Privacy-Preserving Outsourced Speech Recognition for Smart IoT Devices , 2019, IEEE Internet of Things Journal.

[16]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[17]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[18]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tomoki Toda,et al.  The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[20]  Olof Mogren,et al.  Adversarial representation learning for private speech generation , 2020, ArXiv.

[21]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Xin Qin,et al.  FedHealth: A Federated Transfer Learning Framework for Wearable Healthcare , 2019, IEEE Intelligent Systems.

[23]  Tassilo Klein,et al.  Differentially Private Federated Learning: A Client Level Perspective , 2017, ArXiv.

[24]  E. Vincent,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.