Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model

Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis.

[1]  Yannis Stylianou,et al.  Iterative Estimation of Sinusoidal Signal Parameters , 2010, IEEE Signal Processing Letters.

[2]  Luis Weruaga,et al.  Adaptive chirp-based time-frequency analysis of speech signals , 2006, Speech Commun..

[3]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[4]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[5]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[6]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[7]  Yi Hu,et al.  On the importance of preserving the harmonics and neighboring partials prior to vocoder processing: implications for cochlear implants. , 2010, The Journal of the Acoustical Society of America.

[8]  Shinji Maeda,et al.  A digital simulation method of the vocal-tract system , 1982, Speech Commun..

[9]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[10]  Xavier Serra,et al.  A system for sound analysis/transformation/synthesis based on a deterministic plus stochastic decomposition , 1989 .

[11]  Yannis Stylianou,et al.  Adaptive AM–FM Signal Decomposition With Application to Speech Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Daniel W. Griffin,et al.  Multi-band excitation vocoder , 1987 .

[13]  Arturo Camacho Lozano,et al.  SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music , 2011 .

[14]  Yannis Stylianou,et al.  Analysis/synthesis of speech based on an adaptive quasi-harmonic plus noise model , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Eric Moulines,et al.  Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach , 2001, IEEE Trans. Speech Audio Process..

[16]  Ιωάννης Πανταζής Decomposition of AM-FM signals with applications in speech processing , 2010 .

[17]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[18]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[19]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[20]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[21]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[22]  H. Honing,et al.  The potential of the Internet for music perception research: A comment on lab-based versus Web-based studies , 2008 .

[23]  Yannis Stylianou,et al.  A full-band adaptive harmonic representation of speech , 2012, INTERSPEECH.

[24]  Axel Röbel,et al.  Phase Minimization for Glottal Model Estimation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[26]  Jordi Bonada,et al.  Voice Processing and synthesis by performance sampling and spectral models , 2009 .

[27]  Minsoo Hahn,et al.  Two-Band Excitation for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..