You Talk Too Much: Limiting Privacy Exposure Via Voice Input

Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target's voice requires the availability of a corpus of the target's speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public's vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user's speech, while still permitting accurate voice recognition.

[1]  Dora M. Ballesteros L,et al.  Speech Scrambling Based on Imitation of a Target Speech Signal with Non-confidential Content , 2014 .

[2]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[3]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[4]  Katharina von Kriegstein,et al.  How do we recognise who is speaking? , 2014, Frontiers in bioscience.

[5]  Veton Kepuska,et al.  Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx) , 2017 .

[6]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[7]  Yuqiong Sun,et al.  AuDroid: Preventing Attacks on Audio Channels in Mobile Devices , 2015, ACSAC.

[8]  A. Acquisti,et al.  Reputation as a sufficient condition for data quality on Amazon Mechanical Turk , 2013, Behavior Research Methods.

[9]  Zhi Xu,et al.  SemaDroid: A Privacy-Aware Sensor Management Framework for Smartphones , 2015, CODASPY.

[10]  Paris Smaragdis,et al.  A Framework for Secure Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Bhiksha Raj,et al.  Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise , 2013, IEEE Signal Processing Magazine.

[12]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[13]  L. DoraM.Ballesteros,et al.  Speech Scrambling Based on Imitation of a Target Speech Signal with Non-confidential Content , 2014, Circuits Syst. Signal Process..