Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?

Automatic speech recognition (ASR) is a key technology in many services and applications. This typically requires user devices to send their speech data to the cloud for ASR decoding. As the speech signal carries a lot of information about the speaker, this raises serious privacy concerns. As a solution, an encoder may reside on each user device which performs local computations to anonymize the representation. In this paper, we focus on the protection of speaker identity and study the extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR. Through speaker identification and verification experiments on the Librispeech corpus with open and closed sets of speakers, we show that the representations obtained from a standard architecture still carry a lot of information about speaker identity. We then propose to use adversarial training to learn representations that perform well in ASR while hiding speaker identity. Our results demonstrate that adversarial training dramatically reduces the closed-set classification accuracy, but this does not translate into increased open-set verification error hence into increased protection of the speaker identity in practice. We suggest several possible reasons behind this negative result.

[1]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[2]  K. Sekiyama,et al.  Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects , 1997, Perception & psychophysics.

[3]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[4]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .

[5]  Shrikanth Narayanan,et al.  Feature analysis for automatic detection of pathological speech , 2002, Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society] [Engineering in Medicine and Biology.

[6]  Jiucang Hao,et al.  Emotion recognition by speech signals , 2003, INTERSPEECH.

[7]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[8]  Constantine Kotropoulos,et al.  Automatic speech classification to five emotional states based on gender information , 2004, 2004 12th European Signal Processing Conference.

[9]  Karthikeyan Umapathy,et al.  Feature analysis of pathological speech signals using local discriminant bases technique , 2006, Medical and Biological Engineering and Computing.

[10]  Zhen-Yang Wu,et al.  Robust GMM Based Gender Classification using Pitch and RASTA-PLP Parameters of Speech , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[11]  Constantine Kotropoulos,et al.  Gender classification in two Emotional Speech databases , 2008, 2008 19th International Conference on Pattern Recognition.

[12]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[15]  Thomas T. Ballmer,et al.  Speech Act Classification: A Study in the Lexical Analysis of English Speech Activity Verbs , 2011 .

[16]  Manas A. Pathak,et al.  Privacy-Preserving Machine Learning for Speech Processing , 2012 .

[17]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[18]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[19]  Daniela Sammler,et al.  Prosody conveys speaker's intentions: Acoustic cues for speech act perception , 2014 .

[20]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[21]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Elmar Nöth,et al.  A Survey on perceived speaker traits: Personality, likability, pathology, and the first challenge , 2015, Comput. Speech Lang..

[23]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[24]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.

[25]  Ivan Marsic,et al.  Speech Intention Classification with Multimodal Deep Learning , 2017, Canadian Conference on AI.

[26]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Alex X. Liu,et al.  The Insecurity of Home Digital Voice Assistants - Amazon Alexa as a Case Study , 2017, ArXiv.

[28]  Luis A. Guerrero,et al.  Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces , 2017 .

[29]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[30]  Muttukrishnan Rajarajan,et al.  Privacy preserving encrypted phonetic search of speech data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jeffrey M. Voas,et al.  “Alexa, Can I Trust You?” , 2017, Computer.

[32]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[33]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Tetsuji Ogawa,et al.  Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yoshua Bengio,et al.  Learning Anonymized Representations with Adversarial Neural Networks , 2018, ArXiv.

[36]  Veton Kepuska,et al.  Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home) , 2018, 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).

[37]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Nicolas Usunier,et al.  Fully Convolutional Speech Recognition , 2018, ArXiv.

[39]  Nicolas Usunier,et al.  To Reverse the Gradient or Not: an Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).