Psychoacoustical evaluation of PSOLA. II. Double-formant stimuli and the role of vocal perturbation.

This article presents the results of listening experiments and psychoacoustical modeling aimed at evaluating the pitch synchronous overlap-and-add (PSOLA) technique. This technique can be used for simultaneous modification of pitch and duration of natural speech, using simple and efficient time-domain operations on the speech waveform. The first set of experiments tested the ability of subjects to discriminate double-formant stimuli, modified in fundamental frequency using PSOLA, from unmodified stimuli. Of the potential auditory discrimination cues induced by PSOLA, cues from the first formant were found to generally dominate discrimination performance. In the second set of experiments the influence of vocal perturbation, i.e., jitter and shimmer, on discriminability of PSOLA-modified single-formant stimuli was determined. The data show that discriminability deteriorates at most modestly in the presence of jitter and shimmer. With the exception of a few conditions, the trends in these data could be replicated by either using a modulation-discrimination or an intensity-discrimination model, dependent on the formant frequency. As a baseline experiment detection thresholds for jitter and shimmer were measured. Thresholds for jitter could be replicated by using either the modulation-discrimination or the intensity-discrimination model, dependent on the (mean) fundamental frequency of stimuli. The thresholds for shimmer could be accurately predicted for stimuli with a 250-Hz fundamental, but less accurately in the case of a 100-Hz fundamental.

[1]  A Kohlrausch,et al.  Psychoacoustical evaluation of the pitch-synchronous overlap-and-add speech-waveform manipulation technique using single-formant stimuli. , 1997, The Journal of the Acoustical Society of America.

[2]  H. Levitt Transformed up-down methods in psychoacoustics. , 1971, The Journal of the Acoustical Society of America.

[3]  Anton J. Rozsypal,et al.  Perception of jitter and shimmer in synthetic vowels , 1979 .

[4]  Jean Schoentgen,et al.  Predictable and random components of jitter , 1997, Speech Commun..

[5]  A G Askenfelt,et al.  Speech waveform perturbation analysis: a perceptual-acoustical comparison of seven measures. , 1986, Journal of speech and hearing research.

[6]  C. Darwin Perceiving vowels in the presence of another sound: constraints on formant perception. , 1984, The Journal of the Acoustical Society of America.

[7]  J Hillenbrand,et al.  A methodological study of perturbation and additive noise in synthetically generated voice signals. , 1987, Journal of speech and hearing research.

[8]  G H Wakefield,et al.  Discrimination of modulation depth of sinusoidal amplitude modulation (SAM) noise. , 1990, The Journal of the Acoustical Society of America.

[9]  J. Hillenbrand Perception of aperiodicities in synthetically generated voices. , 1988, The Journal of the Acoustical Society of America.

[10]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[11]  Eric Moulines,et al.  Non-parametric techniques for pitch-scale and time-scale modification of speech , 1995, Speech Commun..

[12]  H. K. Dunn,et al.  Statistical Measurements on Conversational Speech , 1940 .

[13]  I R Titze,et al.  Unification of perturbation measures in speech signals. , 1990, The Journal of the Acoustical Society of America.

[14]  David A. Eddins,et al.  Amplitude modulation detection of narrow‐band noise: Effects of absolute bandwidth and frequency region , 1993 .

[15]  Y Horii,et al.  Vocal shimmer in sustained phonation. , 1980, Journal of speech and hearing research.

[16]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[17]  F. Klingholz,et al.  Quantitative spectral evaluation of shimmer and jitter. , 1985, Journal of speech and hearing research.

[18]  Irwin Pollack,et al.  Amplitude and Time Jitter Thresholds for Rectangular‐Wave Trains , 1971 .

[19]  R. Ritsma,et al.  On the perception of imperfect periodicity , 1968 .

[20]  D Kewley-Port,et al.  Modeling formant frequency discrimination of female vowels. , 1996, The Journal of the Acoustical Society of America.

[21]  T. Houtgast Frequency selectivity in amplitude-modulation detection. , 1989, The Journal of the Acoustical Society of America.

[22]  J. W. Horst,et al.  Frequency discrimination of stylized synthetic vowels with a single formant. , 1997, The Journal of the Acoustical Society of America.

[23]  B. Kollmeier,et al.  Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. , 1997, The Journal of the Acoustical Society of America.

[24]  D B Pisoni,et al.  Variability of Vowel Formant Frequencies and the Quantal Theory of Speech: A First Report , 1980, Phonetica.

[25]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[26]  P. Lieberman Perturbations in Vocal Pitch , 1960 .

[27]  So,et al.  An excitation‐pattern model for intensity discrimination , 1981 .