All Your Voices are Belong to Us: Stealing Voices to Fool Humans and Machines

In this paper, we study voice impersonation attacks to defeat humans and machines. Equipped with the current advancement in automated speech synthesis, our attacker can build a very close model of a victim’s voice after learning only a very limited number of samples in the victim’s voice (e.g., mined through the Internet, or recorded via physical proximity). Specifically, the attacker uses voice morphing techniques to transform its voice – speaking any arbitrary message – into the victim’s voice. We examine the aftermaths of such a voice impersonation capability against two important applications and contexts: (1) impersonating the victim in a voice-based user authentication system, and (2) mimicking the victim in arbitrary speech contexts (e.g., posting fake samples on the Internet or leaving fake voice messages).

[1]  Zhizheng Wu,et al.  Voice conversion and spoofing attack on speaker verification systems , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[2]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Harry Hollien,et al.  Perceptual identification of voices under normal, stress, and disguised speaking conditions , 1974 .

[4]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Håkan Melin,et al.  Automatic speaker verification on site and by telephone: methods, applications and assessment , 2006 .

[6]  Junichi Yamagishi,et al.  Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech , 2010, Odyssey.

[7]  Sébastien Marcel,et al.  Spear: An open source toolbox for speaker recognition based on Bob , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Nitesh Saxena,et al.  Wiretapping via Mimicry: Short Voice Imitation Man-in-the-Middle Attacks on Crypto Phones , 2014, CCS.

[10]  Tanja Schultz,et al.  Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion , 2008, SLTU.

[11]  Keiichi Tokuda,et al.  Acoustic-to-articulatory inversion mapping with Gaussian mixture model , 2004, INTERSPEECH.

[12]  Philip Rose Forensic Speaker Identification , 2002 .

[13]  Nicholas W. D. Evans,et al.  On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[14]  Junichi Yamagishi,et al.  Revisiting the security of speaker verification systems against imposture using synthetic speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[16]  Mark A. Chevillet,et al.  Functional Correlates of the Anterolateral Processing Hierarchy in Human Auditory Cortex , 2011, The Journal of Neuroscience.

[17]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[19]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[21]  Sridha Sridharan,et al.  Explicit modelling of session variability for speaker verification , 2008, Comput. Speech Lang..

[22]  P. Satheesh,et al.  SPEAKER RECOGNITION USING GMM , 2010 .

[23]  Andreas Stolcke,et al.  Speaker Recognition With Session Variability Normalization Based on MLLR Adaptation Transforms , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.