A Frequency Domain Approach to ARX-LF Voiced Speech Parameterization and Synthesis

The ARX-LF model interprets voiced speech as the an LF derivative glottal pulse exciting an all-pole vocal tract filter with an additional exogenous residual signal. It fully parameterizes the voice and has been shown to be useful for voice modification. Because time domain methods to determine the ARX-LF parameters from speech are very sensitive to the time placement of the analysis frame and not robust to phase distortion from e.g. recording equipment, a magnitude-only spectral approach to ARX-LF parameterization was recently developed. This paper describes extensions to this frequency domain approach to obtain continuous robust ARX-LF parameters for voiced speech segments. A listening test of 50 participants comparing synthetic speech produced by this method with a time domain ARX-LF parameterization approach under real and simulated recording conditions was conducted and it was found that the frequency domain approach was generally preferred.

[1]  Olivier Rosec,et al.  Estimation of LF glottal source parameters based on an ARX model , 2005, INTERSPEECH.

[2]  Julius O. Smith,et al.  Toward a high-quality singing synthesizer with vocal texture control , 2002 .

[3]  Mike Brookes,et al.  Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[5]  Arturo Camacho Lozano,et al.  SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music , 2011 .

[6]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[7]  Olivier Rosec,et al.  ARX-LF-based source-filter methods for voice modification and transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[9]  Eugene Coyle,et al.  Exploiting glottal formant parameters for glottal inverse filtering and parameterization , 2010, INTERSPEECH.

[10]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[11]  Amro El-Jaroudi,et al.  Discrete all-pole modeling , 1991, IEEE Trans. Signal Process..

[12]  Jacqueline Walker,et al.  A Review of Glottal Waveform Analysis , 2005, WNSP.

[13]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[14]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[15]  Donald G. Childers,et al.  Correction of tape recorder distortion , 1977 .

[16]  Olivier Rosec,et al.  A New Method for Speech Synthesis and Transformation Based on an ARX-LF Source-Filter Decomposition and HNM Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Antonio Bonafonte,et al.  Towards robust glottal source modeling , 2009, INTERSPEECH.

[18]  Arantza Del Pozo,et al.  Voice source and duration modelling for voice conversion and speech repair , 2009 .

[19]  Junichi Yamagishi,et al.  Glottal spectral separation for parametric speech synthesis , 2008, INTERSPEECH.