Voice Privacy with Smart Digital Assistants in Educational Settings

The emergence of voice-assistant devices ushers in delightful user experiences not just on the smart home front, but also in diverse educational environments from classrooms to personalized-learning/tutoring. However, the use of voice as an interaction modality also could result in exposure of user’s identity, and hinders the broader adoption of voice interfaces; this is especially important in environments where children are present and their voice privacy needs to be protected. To this end, building on state-of-the-art techniques proposed in the literature, we design and evaluate a practical and efficient framework for voice privacy at the source. The approach combines speaker identification (SID) and speech conversion methods to randomly disguise the identity of users right on the device that records the speech, while ensuring that the transformed utterances of users can still be successfully transcribed by Automatic Speech Recognition (ASR) solutions. We evaluate the ASR performance of the conversion in terms of word error rate and show the promise of this framework in preserving the content of the input speech.

[1]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Athanasios Mouchtaris,et al.  Non-parallel training for voice conversion by maximum likelihood constrained adaptation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Nikola Pavesic,et al.  De-identification for privacy protection in multimedia content: A survey , 2016, Signal Process. Image Commun..

[4]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[5]  Simon Dobrisek,et al.  Speaker de-identification using diphone recognition and speech synthesis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[6]  Miran Pobar,et al.  Online speaker de-identification using voice transformation , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[7]  Tanja Schultz,et al.  Voice convergin: Speaker de-identification by voice transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[9]  M. Vijay Venkatesh,et al.  Privacy Protection for Life-log Video , 2007 .

[10]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  Hirokazu Kameoka,et al.  Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks , 2017, ArXiv.

[12]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Michael Zimmer,et al.  Understanding the Role of Privacy and Trust in Intelligent Personal Assistant Adoption , 2019, iConference.

[14]  37th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2014, Opatija, Croatia, May 26-30, 2014 , 2014, MIPRO.

[15]  Tanja Schultz,et al.  Speaker de-identification via voice transformation , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Bradley Malin,et al.  Preserving privacy by de-identifying face images , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.