Voice conversion versus speaker verification: an overview

A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.

[1]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Andreas Stolcke,et al.  Modeling duration patterns for speaker recognition , 2003, INTERSPEECH.

[3]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Nicholas W. D. Evans,et al.  A new speaker verification spoofing countermeasure based on local binary patterns , 2013, INTERSPEECH.

[5]  Nicholas W. D. Evans,et al.  On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[6]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[7]  Yun Lei,et al.  Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[9]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  H. Zen,et al.  Continuous Stochastic Feature Mapping Based on Trajectory HMMs , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[12]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[13]  Sébastien Marcel,et al.  On the effectiveness of local binary patterns in face anti-spoofing , 2012, 2012 BIOSIG - Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG).

[14]  Javier Hernando,et al.  Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Keiichi Tokuda,et al.  On the security of HMM-based speaker verification systems against imposture using synthetic speech , 1999, EUROSPEECH.

[17]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[18]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Haizhou Li,et al.  Exemplar-based unit selection for voice conversion utilizing temporal information , 2013, INTERSPEECH.

[20]  Driss Matrouf,et al.  Artificial impostor voice transformation effects on false acceptance rates , 2007, INTERSPEECH.

[21]  Aleksandr Sizov,et al.  Introducing i-vectors for joint anti-spoofing and speaker verification , 2014, INTERSPEECH.

[22]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Steve Renals,et al.  Speaker verification using sequence discriminant support vector machines , 2005, IEEE Transactions on Speech and Audio Processing.

[25]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Haizhou Li,et al.  Dimension reduction of the modulation spectrogram for speaker verification , 2008, Odyssey.

[27]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Zhi-Jie Yan,et al.  A Unified Trajectory Tiling Approach to High Quality Speech Rendering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Moncef Gabbouj,et al.  On the impact of alignment on voice conversion performance , 2008, INTERSPEECH.

[30]  Simon King,et al.  Transforming F0 contours , 2003, INTERSPEECH.

[31]  Olivier Boëffard,et al.  Pitch and Duration Transformation with Non-parallel Data , 2008 .

[32]  Haizhou Li,et al.  Vulnerability evaluation of speaker verification under voice conversion spoofing: the effect of text constraints , 2013, INTERSPEECH.

[33]  Haizhou Li,et al.  Exemplar-based voice conversion using non-negative spectrogram deconvolution , 2013, SSW.

[34]  John H. L. Hansen,et al.  An experimental study of speaker verification sensitivity to computer voice-altered imposters , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[35]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[37]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Bin Ma,et al.  Sparse Classifier Fusion for Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Seyed Hamidreza Mohammadi,et al.  Transmutative voice conversion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  H. Ney,et al.  VTLN-based voice conversion , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[43]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[44]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[45]  Hermann Ney,et al.  Text-Independent Voice Conversion Based on Unit Selection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[46]  Nicholas W. D. Evans,et al.  Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals , 2012, INTERSPEECH.

[47]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[49]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[51]  Elina Helander,et al.  A Novel Method for Prosody Prediction in Voice Conversion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  Sébastien Marcel,et al.  Bob: a free signal processing and machine learning toolbox for researchers , 2012, ACM Multimedia.

[53]  Marcos Faúndez-Zanuy,et al.  Speaker verification security improvement by means of speech watermarking , 2006, Speech Commun..

[54]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Driss Matrouf,et al.  Effect of Speech Transformation on Impostor Acceptance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[56]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[57]  Yun Lei,et al.  Application of convolutional neural networks to speaker recognition in noisy conditions , 2014, INTERSPEECH.

[58]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[59]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[60]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[61]  Jia Liu,et al.  Voice conversion with smoothed GMM and MAP adaptation , 2003, INTERSPEECH.

[62]  Peng Song,et al.  Voice conversion using support vector regression , 2011 .

[63]  Tomi Kinnunen,et al.  Joint Acoustic-Modulation Frequency for Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[64]  Moncef Gabbouj,et al.  Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  John H. L. Hansen,et al.  Speaker-specific pitch contour modeling and modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[66]  Yun Lei,et al.  A deep neural network speaker verification system targeting microphone speech , 2014, INTERSPEECH.

[67]  Pietro Laface,et al.  Fast discriminative speaker verification in the i-vector space , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Thomas Fang Zheng,et al.  Overview of Front-end Features for Robust Speaker Recognition , 2011 .

[69]  Masanobu Abe,et al.  Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1995, Speech Commun..

[70]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[71]  Inma Hernáez,et al.  Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  Mireia Farrús,et al.  How vulnerable are prosodic features to professional imitators? , 2008, Odyssey.

[73]  Haifeng Li,et al.  Sequence error (SE) minimization training of neural network for voice conversion , 2014, INTERSPEECH.

[74]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[75]  Gérard Chollet,et al.  Voice forgery using ALISP: indexation in a client memory , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[76]  Haizhou Li,et al.  A study on replay attack and anti-spoofing for text-dependent speaker verification , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[77]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[78]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[79]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  Hagai Aronowitz,et al.  Voice transformation-based spoofing of text-dependent speaker verification systems , 2013, INTERSPEECH.

[81]  David S̈undermann,et al.  Voice Conversion Matlab Toolbox , 2007 .

[82]  Olivier Rosec,et al.  Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[83]  Yoshihiko Nankaku,et al.  Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching , 2008, INTERSPEECH.

[84]  Toby Berger,et al.  Efficient text-independent speaker verification with structural Gaussian mixture models and neural network , 2003, IEEE Trans. Speech Audio Process..

[85]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[86]  Thierry Dutoit,et al.  Towards a Voice Conversion System Based on Frame Selection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[87]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[88]  John H. L. Hansen,et al.  CRSS systems for 2012 NIST Speaker Recognition Evaluation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[89]  Yoshihiko Nankaku,et al.  Spectral conversion based on statistical models including time-sequence matching , 2007, SSW.

[90]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[91]  Keiichi Tokuda,et al.  Imposture using synthetic speech against speaker verification based on spectrum and pitch , 2000, INTERSPEECH.

[92]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[93]  Nicholas W. D. Evans,et al.  A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[94]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[95]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[96]  Tomi Kinnunen,et al.  I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry , 2013, INTERSPEECH.

[97]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[98]  Haizhou Li,et al.  A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[99]  Haizhou Li,et al.  Text-independent F0 transformation with non-parallel data for voice conversion , 2010, INTERSPEECH.

[100]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[101]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[102]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[103]  Keiichi Tokuda,et al.  A robust speaker verification system against imposture using an HMM-based speech synthesis system , 2001, INTERSPEECH.

[104]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[105]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[106]  Hagai Aronowitz,et al.  New Developments in Voice Biometrics for User Authentication , 2011, INTERSPEECH.

[107]  Zeynep Inanoglu Transforming Pitch in a Voice Conversion Framework , 2003 .

[108]  Mats Blomberg,et al.  Vulnerability in speaker verification - a study of technical impostor techniques , 1999, EUROSPEECH.

[109]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[110]  Javier Hernando,et al.  i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition , 2014, Odyssey.