Sub-band basis spectrum model for pitch-synchronous log-spectrum and phase based on approximation of sparse coding

In this paper, we propose a sub-band basis spectrum model which is a new spectrum representation model based on a linear combination of sub-band basis vectors. We apply sparse coding to the pitch-synchronously analyzed log-spectra. Based on the approximation of the resulting basis, we obtain subband basis vectors with 1-cycle sinusoidal shapes that have mel-scale for lower frequencies and equally spaced scale for higher frequencies. Parameters of the sub-band basis spectrum model representing the log spectrum and the phase spectrum are calculated by fitting the basis to the spectrum. Since the parameters represent the shape of a spectrum, it can be easily used for voice adaptation, interpolation and conversion. Experimental results show that the analysis synthesis speech based on the proposed model is close to original speech and that there is no significant difference between the synthetic speech using analysis-synthesis database and those using original database for unit-fusion based TTS[1].

[1]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[2]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Masatsune Tamura,et al.  Speech synthesis based on the plural unit selection and fusion method using FWF model , 2009, INTERSPEECH.

[4]  Sridhar Krishna Nemala,et al.  Sparse coding for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[6]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[7]  Sadaoki Furui,et al.  A weight estimation method using LDA for multi-band speech recognition , 2006, INTERSPEECH.

[8]  Tatsuya Mizutani,et al.  Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method , 2005, IEICE Trans. Inf. Syst..

[9]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[10]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[11]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.