Deviation measure of waveform symmetry and its application to high-speed and temporally-fine F0 extraction for vocal sound texture manipulation

A simple and high-speed F0 extractor with high temporal resolution is proposed based on a waveform symmetry measure. Strictly speaking, it is not an F0 extractor. Instead, it is a detector of the lowest prominent sinusoidal component with a salience measure. It can make use of an F0 refinement procedure, when the signal under investigation is a sum of harmonic sinusoidal components. The refinement procedure is based on a stable representation of instantaneous frequency of periodic signals. Application of the proposed algorithm revealed that rapid temporal modulations in both F0 trajectory and spectral envelope exist typically in expressive voices such as lively singing performance. Manipulation of these temporal fine structures (texture) effectively modified perceptual expressiveness, while somewhat preserving perceptual vocal effort and register.

[1]  HIDEKI KAWAHARA,et al.  Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework , 2011 .

[2]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[3]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[4]  Axel Röbel,et al.  Applying improved spectral modeling for High Quality voice conversion , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Brad H Story,et al.  Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002. , 2008, The Journal of the Acoustical Society of America.

[6]  Hideki Kawahara,et al.  Analysis and synthesis of strong vocal expressions: Extension and application of audio texture features to singing voice , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  A. Nuttall Some windows with very good sidelobe behavior , 1981 .

[8]  Hideki Kawahara,et al.  Evaluation and optimization of F0-adaptive spectral envelope estimation based on spectral smoothing with peak emphasis , 2010 .

[9]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Hideki Kawahara,et al.  An interference-free representation of instantaneous frequency of periodic signals and its application to F0 extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ken-Ichi Sakakibara,et al.  Growl Voice in Ethnic and Pop Styles , 2004 .

[13]  I. Titze Nonlinear source-filter coupling in phonation: theory. , 2008, The Journal of the Acoustical Society of America.

[14]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[16]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[17]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[18]  J. C. Williams,et al.  Noh voice quality , 2009, Logopedics, phoniatrics, vocology.

[19]  M. Unser Sampling-50 years after Shannon , 2000, Proceedings of the IEEE.

[20]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[21]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.