Simplified aperiodicity representation for high-quality speech manipulation systems

A simple model for generating aperiodic components in synthetic speech is introduced by modifying lower frequency representation for improving voice quality of resynthesized or morphed speech. The new representation is simple enough to arrow intuitive manipulation of this quality relating attribute. The model represents aperiodic component using a sigmoidal function and employs frequency axis warping in the lower frequency region. It also introduced temporal envelope shapers for aperiodic components.

[1]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[2]  A. Nuttall Some windows with very good sidelobe behavior , 1981 .

[3]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[4]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[5]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[6]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[7]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[8]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[9]  Hideki Kawahara,et al.  An interference-free representation of instantaneous frequency of periodic signals and its application to F0 extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hideki Kawahara,et al.  Deviation measure of waveform symmetry and its application to high-speed and temporally-fine F0 extraction for vocal sound texture manipulation , 2012, INTERSPEECH.

[11]  Brad H Story,et al.  Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002. , 2008, The Journal of the Acoustical Society of America.

[12]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[13]  J. C. Williams,et al.  Noh voice quality , 2009, Logopedics, phoniatrics, vocology.

[14]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[15]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[17]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[18]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[19]  Hideki Kawahara,et al.  Evaluation and optimization of F0-adaptive spectral envelope estimation based on spectral smoothing with peak emphasis , 2010 .

[20]  Satoshi Nakamura,et al.  Robust fundamental frequency estimation using instantaneous frequencies of harmonic components , 2000, INTERSPEECH.

[21]  Hideki Kawahara,et al.  Pitch-Scaled Analysis based Residual Reconstruction for Speech Analysis and Synthesis , 2012, INTERSPEECH.

[22]  Hideki Kawahara,et al.  Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Hideki Kawahara,et al.  Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems , 2010, INTERSPEECH.

[24]  Jan Skoglund,et al.  On time-frequency masking in voiced speech , 2000, IEEE Trans. Speech Audio Process..