论文信息 - Prεεch: A System for Privacy-Preserving Speech Transcription

Prεεch: A System for Privacy-Preserving Speech Transcription

New Advances in machine learning have made Automated Speech Recognition (ASR) systems practical and more scalable. These systems, however, pose serious privacy threats as speech is a rich source of sensitive acoustic and textual information. Although offline and open-source ASR eliminates the privacy risks, its transcription performance is inferior to that of cloud-based ASR systems, especially for real-world use cases. In this paper, we propose Pr$\epsilon\epsilon$ch, an end-to-end speech transcription system which lies at an intermediate point in the privacy-utility spectrum. It protects the acoustic features of the speakers' voices and protects the privacy of the textual content at an improved performance relative to offline ASR. Additionally, Pr$\epsilon\epsilon$ch provides several control knobs to allow customizable utility-usability-privacy trade-off. It relies on cloud-based services to transcribe a speech file after applying a series of privacy-preserving operations on the user's side. We perform a comprehensive evaluation of Pr$\epsilon\epsilon$ch, using diverse real-world datasets, that demonstrates its effectiveness. Pr$\epsilon\epsilon$ch provides transcriptions at a 2% to 32.25% (mean 17.34%) relative improvement in word error rate over Deep Speech, while fully obfuscating the speakers' voice biometrics and allowing only a differentially private view of the textual content.

[1] Moustapha Cissé,et al. Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Doug Downey,et al. Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[3] Aaron Roth,et al. The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[4] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[5] Sarana Nutanong,et al. A Scalable Framework for Stylometric Analysis Query Processing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[6] Hao Wang,et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[7] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[8] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.

[9] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[10] Mats Blomberg,et al. Vulnerability in speaker verification - a study of technical impostor techniques , 1999, EUROSPEECH.

[11] Minhui Xue,et al. The Audio Auditor: Participant-Level Membership Inference in Internet of Things Voice Services , 2019 .

[12] Li-Rong Dai,et al. WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[13] Chao Chen,et al. The Audio Auditor: Participant-Level Membership Inference in Voice-Based IoT , 2019, ArXiv.

[14] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Paul Lamere,et al. Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[16] Yu Wang,et al. VoiceMask: Anonymize and Sanitize Voice Input on Mobile Devices , 2017, ArXiv.

[17] Assaf Schuster,et al. Data mining with differential privacy , 2010, KDD.

[18] Marc Tommasi,et al. Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion? , 2019, INTERSPEECH.

[19] Catuscia Palamidessi,et al. Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[20] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Hamed Haddadi,et al. Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants , 2019, ArXiv.

[22] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[23] Manfred K. Warmuth,et al. THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[24] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[25] Micah Sherr,et al. You Talk Too Much: Limiting Privacy Exposure Via Voice Input , 2019, 2019 IEEE Security and Privacy Workshops (SPW).

[26] Haizhou Li,et al. Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[27] Gautham J. Mysore,et al. Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges , 2015, IEEE Signal Processing Letters.

[28] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.

[29] Ahmad-Reza Sadeghi,et al. VoiceGuard: Secure and Private Speech Processing , 2018, INTERSPEECH.

[30] Tomoki Toda,et al. sprocket: Open-Source Voice Conversion Software , 2018, Odyssey.

[31] Bhiksha Raj,et al. Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise , 2013, IEEE Signal Processing Magazine.

[32] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[33] Paul Deléglise,et al. TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[34] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[35] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[36] Francesco Caltagirone,et al. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[37] Isabel Trancoso,et al. The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding , 2019, INTERSPEECH.

[38] Xiang-Yang Li,et al. Towards Privacy-Preserving Speech Data Publishing , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[39] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Sébastien Marcel,et al. Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[42] Saeid Safavi,et al. Automatic speaker, age-group and gender identification from children's speech , 2018, Comput. Speech Lang..

[43] Björn W. Schuller,et al. Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[44] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[45] Bayya Yegnanarayana,et al. Characterization of Glottal Activity From Speech Signals , 2009, IEEE Signal Processing Letters.

[46] Björn W. Schuller,et al. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[47] Björn Schuller,et al. Computational Paralinguistics , 2013 .

[48] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Franziska Roesner,et al. Investigating the Computer Security Practices and Needs of Journalists , 2015, USENIX Security Symposium.

[50] M. Davino. Assessing privacy risk in outsourcing. , 2004, Journal of AHIMA.

[51] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[52] Benjamin C. M. Fung,et al. Publishing set-valued data via differential privacy , 2011, Proc. VLDB Endow..

[53] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[54] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[55] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[56] Junichi Yamagishi,et al. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[57] Rafael Valle,et al. Attacking Speaker Recognition With Deep Generative Models , 2018, ArXiv.

[58] Ramesh Nallapati,et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[59] Junichi Yamagishi,et al. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.