To Investigate The Accuracy Of The Vector Quantization Based Transformation Function For Voice Conversion

Voice conversion involves transformation of speaker characteristics in a speech uttered by a speaker called source speaker so as to generate a speech having voice characteristics of a desired speaker called target speaker. There are various models used for voice conversion such as hidden Markov model (HMM), artificial neural network (ANN), vector quantization (VQ) and dynamic time warping (DTW) based. The quality of transformed speech depends upon the accuracy of the transformation function. For obtaining an accurate transformation function, the alignment of the passages spoken by source and target speakers should be properly aligned. These correspondences are formed by segmenting the spectral vectors of the source and target speakers into clusters using VQ-based clustering. VQ reduces the computation amount and memory size drastically. The objective of the paper is to investigate the effect of VQ based transformation function estimation on the closeness of the transformed speech towards.

[1]  Moncef Gabbouj,et al.  LSF mapping for voice conversion with very small training sets , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  J. Moraes,et al.  A real time QRS complex classification method using Mahalanobis distance , 2002, Computers in Cardiology.

[3]  Prem C. Pandey,et al.  Estimation of Place of Articulation During Stop Closures of Vowel–Consonant–Vowel Utterances , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Qin Yan,et al.  Voice conversion through transformation of spectral and intonation features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Fumitaka Kimura,et al.  On the bias of Mahalanobis distance due to limited sample size effect , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[7]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[8]  Zhiwei Shuang,et al.  Frequency warping based on mapping formant parameters , 2006, INTERSPEECH.

[9]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[10]  Levent M. Arslan,et al.  Speaker Transformation Algorithm using Segmental Codebooks (STASC) , 1999, Speech Commun..

[11]  Levent M. Arslan,et al.  Speaker transformation using sentence HMM based alignments and detailed prosody modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[13]  W. Endres,et al.  Voice spectrograms as a function of age, voice disguise, and voice imitation. , 1971, The Journal of the Acoustical Society of America.

[14]  G. Phillips Interpolation and Approximation by Polynomials , 2003 .

[15]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[16]  Masanobu Abe,et al.  Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  David V. Anderson,et al.  Learning distances to improve phoneme classification , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.

[18]  P. C. Pandey,et al.  Transformation of short-term spectral envelope of speech signal using multivariate polynomial modeling , 2011, 2011 National Conference on Communications (NCC).

[19]  Toshio Kamei Face retrieval by an adaptive Mahalanobis distance using a confidence factor , 2002, Proceedings. International Conference on Image Processing.

[20]  Vaughan R. Pratt,et al.  Direct least-squares fitting of algebraic surfaces , 1987, SIGGRAPH.

[21]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[22]  O. Postolache,et al.  Fitting transducer characteristics to measured data , 2001 .

[23]  M. Sambur,et al.  Selection of acoustic features for speaker identification , 1975 .

[24]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[25]  Yoshinori Sagisaka,et al.  Speech spectrum transformation by speaker interpolation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[27]  Douglas D. O'Shaughnessy,et al.  Compensated mel frequency cepstrum coefficients , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.