Configurable Privacy-Preserving Automatic Speech Recognition

Voice assistive technologies have given rise to far-reaching privacy and security concerns. In this paper we investigate whether modular automatic speech recognition (ASR) can improve privacy in voice assistive systems by combining independently trained separation, recognition, and discretization modules to design configurable privacy-preserving ASR systems. We evaluate privacy concerns and the effects of applying various stateof-the-art techniques at each stage of the system, and report results using task-specific metrics (i.e. WER, ABX, and accuracy). We show that overlapping speech inputs to ASR systems present further privacy concerns, and how these may be mitigated using speech separation and optimization techniques. Our discretization module is shown to minimize paralinguistics privacy leakage from ASR acoustic models to levels commensurate with random guessing. We show that voice privacy can be configurable, and argue this presents new opportunities for privacy-preserving applications incorporating ASR.

[1]  Parameswaran Ramanathan,et al.  Prεεch: A System for Privacy-Preserving Speech Transcription , 2019, USENIX Security Symposium.

[2]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[3]  Marc Tommasi,et al.  Design Choices for X-vector Based Speaker Anonymization , 2020, INTERSPEECH.

[4]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[5]  Shaojin Ding,et al.  Personal VAD: Speaker-Conditioned Voice Activity Detection , 2019, Odyssey.

[6]  Tomoki Toda,et al.  Crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Antoine Deleforge,et al.  LibriMix: An Open-Source Dataset for Generalizable Speech Separation , 2020, 2005.11262.

[8]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Marc Tommasi,et al.  Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Patrick Traynor,et al.  SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems , 2020, 2021 IEEE Symposium on Security and Privacy (SP).

[11]  H. Haddadi,et al.  Privacy-preserving Voice Analysis via Disentangled Representations , 2020, CCSW@CCS.

[12]  Karen Livescu,et al.  A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings , 2020, ArXiv.

[13]  E. Vincent,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.

[14]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[15]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[16]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[17]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Shinji Watanabe,et al.  Online End-To-End Neural Diarization with Speaker-Tracing Buffer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[19]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[20]  Michael Chinen,et al.  Generative Speech Coding with Predictive Variance Regularization , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Emily Mower Provost,et al.  Privacy Enhanced Multimodal Neural Representations for Emotion Recognition , 2019, AAAI.

[22]  Nicholas W. D. Evans,et al.  Preserving privacy in speaker and speech characterisation , 2019, Comput. Speech Lang..

[23]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[25]  Daniel J. Dubois,et al.  When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers , 2020, Proc. Priv. Enhancing Technol..

[26]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Wei Li,et al.  VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition , 2020, INTERSPEECH.

[28]  Ewan Dunbar,et al.  The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling , 2020, ArXiv.

[29]  Hamed Haddadi,et al.  Emotion Filtering at the Edge , 2019, SenSys-ML.