Voice Transformation: A survey

Voice transformation refers to the various modifications one may apply to the sound produced by a person, speaking or singing. Voice Transformation is usually seen as an add-on or an external system in speech synthesis systems since it may create virtual voices in a simple and flexible way. In this paper we review the state-of-the-art Voice Transformation methodology showing its limitations in producing good speech quality and its current challenges. Addressing quality issues of current voice transformation algorithms in conjunction with properties of the speech production and speech perception systems we try to pave the way for more natural Voice Transformation algorithms in the future. Facing the challenges, will allow Voice Transformation systems to be applied in important and versatile areas of speech technology; applications that are far beyond speech synthesis.

[1]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Levent M. Arslan,et al.  Robust processing techniques for voice conversion , 2006, Comput. Speech Lang..

[4]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[5]  Tatsuya Kitamura,et al.  Acoustic analysis of imitated voice produced by a professional impersonator , 2008, INTERSPEECH.

[6]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Yoshinori Sagisaka,et al.  Acoustic characteristics of speaker individuality: Control and conversion , 1995, Speech Commun..

[8]  Antonio Bonafonte,et al.  Including dynamic and phonetic information in voice conversion systems , 2004, INTERSPEECH.

[9]  Pickett The Sounds of Speech Communication , 1980 .

[10]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Hui Ye,et al.  Quality-enhanced voice morphing using maximum likelihood transformations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Alexander Kain,et al.  High-resolution voice transformation , 2001 .

[13]  Hideki Kawahara,et al.  Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Juan Carlos,et al.  Review of "Discrete-Time Speech Signal Processing - Principles and Practice", by Thomas Quatieri, Prentice-Hall, 2001 , 2003 .

[15]  Olivier Boëffard,et al.  GMM-based speech transformation systems under data reduction , 2007, SSW.

[16]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[17]  Eric Moulines,et al.  High-quality speech modification based on a harmonic + noise model , 1995, EUROSPEECH.

[18]  Anders Eriksson,et al.  How flexible is the human voice? - a case study of mimicry , 1997, EUROSPEECH.

[19]  Yannis Stylianou,et al.  Stochastic modeling of spectral adjustment for high quality pitch modification , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Yoshinori Sagisaka,et al.  Speech spectrum transformation by speaker interpolation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Ashish Verma,et al.  Voice fonts for individuality representation and transformation , 2005, TSLP.

[22]  E. Zetterholm Same speaker - different voices. A study of one impersonator and some of his different imitations. , 2006 .

[23]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[24]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Douglas A. Reynolds,et al.  Fine structure features for speaker identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  Yannis Stylianou,et al.  Detection of non-stationarity in speech signals and its application to time-scaling , 1999, EUROSPEECH.

[27]  Daniel Erro,et al.  On combining statistical methods and frequency warping for high-quality voice conversion , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.