论文信息 - Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis

Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis

This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor. A preliminary implementation of the proposed framework substantially outperformed (by a factor of 10 in terms of RMS F0 estimation error) existing F0 extractors in tracking ability of temporally varying F0 trajectories. The front end aperiodicity detector consists of a complex-valued wavelet analysis filter with a highly selective temporal and spectral envelope. This front end aperiodicity detector uses a new measure that quantifies the deviation from periodicity. The measure is less sensitive to slow FM and AM and closely correlates with the signal to noise ratio.

[1] I R Titze,et al. Perception of pitch and roughness in vocal signals with subharmonics. , 2001, Journal of voice : official journal of the Voice Foundation.

[2] Roy D. Patterson,et al. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[3] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4] Heiga Zen,et al. Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[5] G. P. Moore,et al. A model for vocal fold vibratory motion, contact area, and the electroglottogram. , 1986, The Journal of the Acoustical Society of America.

[6] J. Liljencrants,et al. Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[7] Abeer Alwan,et al. Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Hideki Kawahara,et al. Temporally variable multi-aspect N-way morphing based on interference-free speech representations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[9] Abeer Alwan,et al. Perceptual differences among models of the voice source: Further evidence , 2014 .

[10] Ken-Ichi Sakakibara,et al. Physiological observations and synthesis of subharmonic voices , 2011 .

[11] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[12] D. Klatt,et al. Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[13] Petros Maragos,et al. On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[14] Hideki Kawahara,et al. Fast and Reliable F0 Estimation Method Based on the Period Extraction of Vocal Fold Vibration of Singing Voice and Speech , 2009 .

[15] D. Slepian,et al. Prolate spheroidal wave functions, fourier analysis and uncertainty — II , 1961 .

[16] H. Pollak,et al. Prolate spheroidal wave functions, fourier analysis and uncertainty — III: The dimension of the space of essentially time- and band-limited signals , 1962 .

[17] T. Abe,et al. The IF Spectrogram : A New Spectral Representation , 1997 .

[18] Daniel P. W. Ellis,et al. Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[19] Hirokazu Kameoka,et al. Generative Modeling of Voice Fundamental Frequency Contours , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] A. Nuttall. Some windows with very good sidelobe behavior , 1981 .

[21] Ingo R. Titze,et al. Principles of voice production , 1994 .

[22] Masashi Unoki,et al. Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis , 2005, Speech Commun..

[23] A. L. Wang. Instantaneous and frequency-warped techniques for source separation and signal parametrization , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[24] Axel Röbel,et al. A multi-layer F0 model for singing voice synthesis using a b-spline representation with intuitive controls , 2015, INTERSPEECH.

[25] Hideki Kawahara,et al. Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT , 2005, INTERSPEECH.

[26] Thomas F. Quatieri,et al. A time-warping framework for speech turbulence-noise component estimation during aperiodic phonation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] D G Childers,et al. Modeling the glottal volume-velocity waveform for three voice types. , 1995, The Journal of the Acoustical Society of America.

[28] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[29] Tomoki Toda,et al. Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[30] Hiroya Fujisaki,et al. Prosody, Models, and Spontaneous Speech , 1997, Computing Prosody.

[31] John G Harris,et al. A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[32] Hideki Kawahara. SparkNG: Interactive MATLAB Tools for Introduction to Speech Production, Perception and Processing Fundamentals and Application of the Aliasing-Free L-F Model Component , 2016, INTERSPEECH.

[33] D G Childers,et al. Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[34] D. Slepian. Prolate spheroidal wave functions, fourier analysis, and uncertainty — V: the discrete case , 1978, The Bell System Technical Journal.

[35] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[36] Yannis Agiomyrgiannakis,et al. Vocaine the vocoder and applications in speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).