Design of Voice Privacy System using Linear Prediction

Speaker’s identity is the most crucial information exploited (implicitly) by an Automatic Speaker Verification (ASV) system. Numerous attacks can be obliterated simultaneously if privacy preservation is exercised for a speaker’s identity. The baseline of the Voice Privacy Challenge 2020 by INTERSPEECH uses the Linear Prediction (LP) model of speech, and McAdam’s coefficient for achieving speaker de-identification. The baseline approach focuses on altering only the pole angles using McAdam’s coefficient. However, from speech acoustics and digital resonator design, the radius of the poles is associated with various energy losses. The energy losses implicitly carry speaker-specific information during speech production. To that effect, the authors have brought fine-tuned changes in both pole angle and pole radius, resulting in 18.98% higher value of EER for Vctk-test-com dataset, and 5% lower WER for Libri-test dataset compared to the baseline. This means privacy-preservation is indeed improved by our approach. Furthermore, we have exploited the relatively poor spectral resolution of female speakers to our advantage for achieving effective anonymization. To that effect, gender-based analysis of the obtained results reveals that our approach leads to better speaker anonymization for females as compared to the male speakers.

[1]  Emmanuel Vincent,et al.  The VoicePrivacy 2020 Challenge Evaluation Plan , 2022, ArXiv.

[2]  Nicholas Evans,et al.  Speaker anonymisation using the McAdams coefficient , 2020, Interspeech.

[3]  Junichi Yamagishi,et al.  Introducing the VoicePrivacy Initiative , 2020, INTERSPEECH.

[4]  E. Vincent,et al.  Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[6]  Nicholas W. D. Evans,et al.  Preserving privacy in speaker and speech characterisation , 2019, Comput. Speech Lang..

[7]  Hemant A. Patil,et al.  Energy Separation Algorithm Based Spectrum Estimation for Very Short Duration of Speech , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[8]  Isabel Trancoso,et al.  The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding , 2019, INTERSPEECH.

[9]  Yifan Gong,et al.  Encrypted Speech Recognition Using Deep Polynomial Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Linlin Chen,et al.  Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity , 2018, SenSys.

[11]  Xiang-Yang Li,et al.  Towards Privacy-Preserving Speech Data Publishing , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[12]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Rohan Kumar Das,et al.  Countermeasure to handle replay attacks in practical speaker verification systems , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).

[14]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Nicholas W. D. Evans,et al.  Re-assessing the threat of replay spoofing attacks against automatic speaker verification , 2014, 2014 International Conference of the Biometrics Special Interest Group (BIOSIG).

[16]  Bhiksha Raj,et al.  Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise , 2013, IEEE Signal Processing Magazine.

[17]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[18]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Paris Smaragdis,et al.  A Framework for Secure Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  H.A. Patil,et al.  On the Investigation of Spectral Resolution Problem for Identification of Female Speakers in Bengali , 2006, 2006 IEEE International Conference on Industrial Technology.

[22]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[23]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[24]  Qin Yan,et al.  Voice conversion through transformation of spectral and intonation features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  William M. Campbell,et al.  Speaker recognition with polynomial classifiers , 2002, IEEE Trans. Speech Audio Process..

[26]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[27]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[28]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[29]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Masanobu Abe,et al.  A formant frequency modification algorithm dealing with the pole interaction , 1996 .

[31]  Janet Slifka,et al.  Speaker modification with LPC pole analysis , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  Stephen McAdams,et al.  Spectral fusion, spectral parsing and the formation of auditory images , 1984 .

[33]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[34]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[35]  Manfred R. Schroeder,et al.  Vocoders: Analysis and synthesis of speech , 1966 .

[36]  L. G. Kersta Voiceprint Identification , 1962, Nature.

[37]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .