Speaker de-identification via voice transformation

It is a common feature of modern automated voice-driven applications and services to record and transmit a user's spoken request. At the same time, several domains and applications may require keeping the content of the user's request confidential and at the same time preserving the speaker's identity. This requires a technology that allows the speaker's voice to be de-identified in the sense that the voice sounds natural and intelligible but does not reveal the identity of the speaker. In this paper we investigate different voice transformation strategies on a large population of speakers to disguise the speakers' identities while preserving the intelligibility of the voices. We apply two automatic speaker identification approaches to verify the success of de-identification with voice transformation, a GMM-based and a Phonetic approach. The evaluation based on the automatic speaker identification systems verifies that the proposed voice transformation technique enables transmission of the content of the users' spoken requests while successfully preserving their identities. Also, the results indicate that different speakers still sound distinct after the transformation. Furthermore, we carried out a human listening test that proved the transformed speech to be both intelligible and securely de-identified, as it hid the identity of the speakers even to listeners who knew the speakers very well.

[1]  Ralph Gross,et al.  Model-Based Face De-Identification , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[2]  Tanja Schultz,et al.  Voice convergin: Speaker de-identification by voice transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Kornel Laskowski,et al.  Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[5]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[6]  Tanja Schultz,et al.  Phonetic speaker identification , 2002, INTERSPEECH.

[7]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.