Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis

This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio.

[1]  I R Titze,et al.  Perception of pitch and roughness in vocal signals with subharmonics. , 2001, Journal of voice : official journal of the Voice Foundation.

[2]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[3]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[5]  G. P. Moore,et al.  A model for vocal fold vibratory motion, contact area, and the electroglottogram. , 1986, The Journal of the Acoustical Society of America.

[6]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[7]  Abeer Alwan,et al.  Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Hideki Kawahara,et al.  Temporally variable multi-aspect N-way morphing based on interference-free speech representations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[9]  Abeer Alwan,et al.  Perceptual differences among models of the voice source: Further evidence , 2014 .

[10]  Ken-Ichi Sakakibara,et al.  Physiological observations and synthesis of subharmonic voices , 2011 .

[11]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[12]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[13]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[14]  Hideki Kawahara,et al.  Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[15]  D. Slepian,et al.  Prolate spheroidal wave functions, fourier analysis and uncertainty — II , 1961 .

[16]  H. Pollak,et al.  Prolate spheroidal wave functions, fourier analysis and uncertainty — III: The dimension of the space of essentially time- and band-limited signals , 1962 .

[17]  T. Abe,et al.  The IF Spectrogram : A New Spectral Representation , 1997 .

[18]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[19]  Hirokazu Kameoka,et al.  Generative Modeling of Voice Fundamental Frequency Contours , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  A. Nuttall Some windows with very good sidelobe behavior , 1981 .

[21]  Ingo R. Titze,et al.  Principles of voice production , 1994 .

[22]  Masashi Unoki,et al.  Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis , 2005, Speech Commun..

[23]  A. L. Wang Instantaneous and frequency-warped techniques for source separation and signal parametrization , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[24]  Axel Röbel,et al.  A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[25]  Hideki Kawahara,et al.  Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[26]  Thomas F. Quatieri,et al.  A time-warping framework for speech turbulence-noise component estimation during aperiodic phonation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  D G Childers,et al.  Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[28]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[29]  Tomoki Toda,et al.  Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[30]  Hiroya Fujisaki,et al.  Prosody, Models, and Spontaneous Speech , 1997, Computing Prosody.

[31]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[32]  Hideki Kawahara SparkNG: Interactive MATLAB Tools for Introduction to Speech Production, Perception and Processing Fundamentals and Application of the Aliasing-Free L-F Model Component , 2016, INTERSPEECH.

[33]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[34]  D. Slepian Prolate spheroidal wave functions, fourier analysis, and uncertainty — V: the discrete case , 1978, The Bell System Technical Journal.

[35]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[36]  Yannis Agiomyrgiannakis,et al.  Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).