Automatic voice onset time estimation from reassignment spectra

We describe an algorithm to automatically estimate the voice onset time (VOT) of plosives. The VOT is the time delay between the burst onset and the start of periodicity when it is followed by a voiced sound. Since the VOT is affected by factors like place of articulation and voicing it can be used for inference of these factors. The algorithm uses the reassignment spectrum of the speech signal, a high resolution time-frequency representation which simplifies the detection of the acoustic events in a plosive. The performance of our algorithm is evaluated on a subset of the TIMIT database by comparison with manual VOT measurements. On average, the difference is smaller than 10ms for 76.1% and smaller than 20ms for 91.4% of the plosive segments. We also provide analysis statistics of the VOT of /b/, /d/, /g/, /p/, /t/ and /k/ and experimentally verify some sources of variability. Finally, to illustrate possible applications, we integrate the automatic VOT estimates as an additional feature in an HMM-based speech recognition system and show a small but statistically significant improvement in phone recognition rate.

[1]  Patrick Flandrin,et al.  Improving the readability of time-frequency and time-scale representations by the reassignment method , 1995, IEEE Trans. Signal Process..

[2]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[3]  Sandra P Whiteside,et al.  Sex differences in voice onset time: a developmental study of phonetic context effects in British English. , 2004, The Journal of the Acoustical Society of America.

[4]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[5]  Kris Demuynck,et al.  Extracting, modelling and combining information in speech recognition , 2001 .

[6]  Christopher R. McCrea,et al.  The effects of fundamental frequency level on voice onset time in normal adult male speakers. , 2005, Journal of speech, language, and hearing research : JSLHR.

[7]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[8]  Lawrence J. Raphael,et al.  Speech Science Primer: Physiology, Acoustics, and Perception of Speech , 1980 .

[9]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[10]  Jun Xiao,et al.  Multitaper Time-Frequency Reassignment for Nonstationary Spectrum Estimation and Chirp Enhancement , 2007, IEEE Transactions on Signal Processing.

[11]  Partha Niyogi,et al.  The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets , 2003, Speech Commun..

[12]  S. Gabel,et al.  Using Neural Networks , 2003 .

[13]  Daniele Falavigna,et al.  Word duration modeling for word graph rescoring in LVCSR , 2007, INTERSPEECH.

[14]  Stella M. O'Brien Special Features of Plosives in Connected-Speech Signals , 1993, Int. J. Man Mach. Stud..

[15]  Elizabeth Shriberg,et al.  Consonant discrimination in elicited and spontaneous speech: a case for signal-adaptive front ends in ASR , 2000, INTERSPEECH.

[16]  Wayne H. Ward,et al.  Speech recognition , 1997 .

[17]  Malcolm D. Macleod,et al.  Time Frequency Reassignment: A Review and Analysis , 2003 .

[18]  Jong Kyoung Kim,et al.  Speech recognition , 1983, 1983 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[19]  Biing-Hwang Juang,et al.  An overview on automatic speech attribute transcription (ASAT) , 2007, INTERSPEECH.

[20]  Partha Niyogi,et al.  The voicing feature for stop consonants: acoustic phonetic analyses and automatic speech recognition experiments , 1998, ICSLP.

[21]  Fabrice Plante,et al.  Improvement of speech spectrogram accuracy by the method of reassignment , 1998, IEEE Trans. Speech Audio Process..

[22]  Partha Niyogi,et al.  Incorporating voice onset time to improve letter recognition accuracies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  Abeer Alwan,et al.  On the perception of voicing in syllable-initial plosives in noise. , 2006, The Journal of the Acoustical Society of America.

[24]  Dariusz A. Zwierzynski,et al.  The use of discriminant neural networks in the integration of acoustic cues for voicing into a continuous-word recognition system , 1990, ICSLP.

[25]  P. Beyerlein Discriminative model combination , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[26]  Hugo Van hamme,et al.  Robust phone lattice decoding , 2006, INTERSPEECH.

[27]  Jean-Pierre Martens,et al.  Speech recognition with phonological features: some issues to attend , 2006, INTERSPEECH.

[28]  Abeer Alwan,et al.  Automatic detection of voice onset time contrasts for use in pronunciation assessment , 2006, INTERSPEECH.