Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants

Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits. Service providers can build a very accurate profile of a user's demographic category, personal preferences, and may compromise privacy. To address this problem, a privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input. It aims to maintain utility while preserving user privacy. It achieves this by collecting real time speech data and analyzes the signal to ensure privacy protection prior to sharing of this data with services providers. Precisely, the sensitive representations are extracted from the raw signal by using transformation functions and then wrapped it via voice conversion technology. Experimental evaluation based on emotion recognition to assess the efficacy of the proposed method shows that identification of sensitive emotional state of the speaker is reduced by ~96 %.

[1]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  S. R. Livingstone,et al.  The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English , 2018, PloS one.

[3]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[4]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[5]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[7]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Björn Schuller,et al.  Emotional expression in psychiatric conditions: New technology for clinicians , 2018, Psychiatry and clinical neurosciences.

[9]  Lauri Juvela,et al.  Speaking Style Conversion from Normal to Lombard Speech Using a Glottal Vocoder and Bayesian GMMs , 2017, INTERSPEECH.

[10]  Marilyn A. Walker,et al.  Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text , 2007, J. Artif. Intell. Res..

[11]  Steve J. Young,et al.  A system for transforming the emotion in speech: combining data-driven conversion techniques for prosody and voice quality , 2007, INTERSPEECH.

[12]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[13]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[14]  L. Cosmides,et al.  Adaptations in humans for assessing physical strength from the voice , 2010, Proceedings of the Royal Society B: Biological Sciences.

[15]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[16]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Hideki Kawahara,et al.  Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[18]  B. Yegnanarayana,et al.  Voice conversion: Factors responsible for quality , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[20]  R. Krauss,et al.  Inferring speakers’ physical attributes from their voices , 2002 .

[21]  Todor Ganchev,et al.  Estimation of unknown speaker’s height from speech , 2009, Int. J. Speech Technol..

[22]  Masanori Morise,et al.  CheapTrick, a spectral envelope estimator for high-quality speech synthesis , 2015, Speech Commun..

[23]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[24]  Scott R. Peppet Regulating the Internet of Things: First Steps Toward Managing Discrimination, Privacy, Security & Consent , 2014 .

[25]  Masanori Morise,et al.  D4C, a band-aperiodicity estimator for high-quality speech synthesis , 2016, Speech Commun..

[26]  Constantinos Patsakis,et al.  Monkey Says, Monkey Does: Security and Privacy on Voice Assistants , 2017, IEEE Access.