Automating phonetic measurement: The case of voice onset time

We present an architecture for locating phonetic events accurately in time, and for measuring time differences between nearby events, using Voice Onset Time (VOT) as a case study. Although VOT remains a central concern in the field, phoneticians' VOT measurements generally continue to rely on human judgment. This requires significant labor, makes even large laboratory experiments onerous, and prevents the field from taking full advantage of the millions of hours of digital speech now becoming available. Our algorithm accurately automates VOT measurement, by combining HMM forced alignment for determining approximate stop boundaries with paired burst and voicing onset detectors. Each detector is a frame-level max margin classifier operating on the scale-space projection of a small number of relevant acoustic features. On a large set of clean lab speech, this system has a mean absolute error (relative to human annotation) of only 2.8 ms, with 98% of errors <10 ms. On a subcorpus independently annotated by two of the authors, the system agreed with the two human annotators as well as they agreed with one another (1.49 ms vs 1.50 ms). Promising results on other datasets will be reported. The system will be released as open-source software.

[1]  Hugo Van hamme,et al.  Automatic voice onset time estimation from reassignment spectra , 2009, Speech Commun..

[2]  Mark Hasegawa-Johnson,et al.  Prosodic effects on acoustic cues to stop voicing and place of articulation: Evidence from Radio News speech , 2007, J. Phonetics.

[3]  Morgan Sonderegger,et al.  Automatic discriminative measurement of voice onset time , 2010, INTERSPEECH.

[4]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[5]  Michelle Annette Minnick Fox,et al.  Usage -based effects in Latin American Spanish syllable -final /s/ lenition , 2006 .

[6]  Jer-Ming Chen,et al.  Proceedings of Meetings on Acoustics , 2013 .

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  E. Růžička,et al.  Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson's disease. , 2011, The Journal of the Acoustical Society of America.

[9]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Keith Johnson,et al.  Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech , 2012 .

[11]  Mark Liberman,et al.  Automatic formant extraction for sociolinguistic analysis of large corpora , 2009, Interspeech.

[12]  D. Klatt Voice onset time, frication, and aspiration in word-initial consonant clusters. , 1975, Journal of speech and hearing research.

[13]  Mark Liberman,et al.  Automatic Measurement and Comparison of Vowel Nasalization across Languages , 2011, ICPhS.

[14]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[15]  P Niyogi,et al.  Detecting stop consonants in continuous speech. , 2002, The Journal of the Acoustical Society of America.

[16]  D. Borrelli Raddoppiamento Sintattico in Italian: A Synchronic and Diachronic Cross-Dialectical Study , 2002 .

[17]  D Marr,et al.  Early processing of visual information. , 1976, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[18]  Mark Liberman,et al.  Automatic detection of “g-dropping” in American English using forced alignment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Mark Liberman,et al.  Investigating /l/ variation in English through forced alignment , 2009, INTERSPEECH.