Online speaker de-identification using voice transformation

Speaker de-identification is the process by which speech is transformed in a way that the speaker identity is masked, while at the same time the transformed speech preserves acoustic information that contributes to the intelligibility, naturalness and clarity. Systems that perform speech de-identification could be used in voice driven applications (for example in call centres) where the speaker's identity has to be hidden. The paper describes the experiments we have performed in order to de-identify speech using GMM based voice transformation techniques and speaker identification using freely available tools. We propose a method by which speakers whose speech has not been used to build voice transformations (for training) can be efficiently de-identified online. The proposed method is evaluated using a speech database of read speech and a small set of speakers. The results we present show that the proposed de-identification method performs similarly as a closed-set de-identification procedure that requires previous enrolment and can efficiently be used for online speaker de-identification.

[1]  P. J. Narayanan,et al.  Person De-Identification in Videos , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Tanja Schultz,et al.  Voice convergin: Speaker de-identification by voice transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  S. Martincic-Ipsic,et al.  VEPRAD: a Croatian speech database of weather forecasts , 2003, Proceedings of the 25th International Conference on Information Technology Interfaces, 2003. ITI 2003..

[4]  Mike Brookes,et al.  The DYPSA algorithm for estimation of glottal closure instants in voiced speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Keiichi Tokuda,et al.  A training method for average voice model based on shared decision tree context clustering and speaker adaptive training , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Tanja Schultz,et al.  Speaker de-identification via voice transformation , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[8]  Daniel Erro,et al.  Weighted frequency warping for voice conversion , 2007, INTERSPEECH.

[9]  Nicholas W. D. Evans,et al.  ALIZE/spkdet: a state-of-the-art open source software for speaker recognition , 2008, Odyssey.

[10]  Daniel Erro,et al.  Frame alignment method for cross-lingual voice conversion , 2007, INTERSPEECH.

[11]  Alexandros Iosifidis,et al.  Person de-identification in activity videos , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[12]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .