Automatic measurement of voice onset time using discriminative structured prediction.

A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.

[1]  Partha Niyogi,et al.  The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets , 2003, Speech Commun..

[2]  G. Docherty The Timing of Voicing in British English Obstruents , 1992 .

[3]  S. Blumstein,et al.  Effects of speaking rate on voice-onset time in Thai, French, and English , 1997 .

[4]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[5]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[6]  Peter Graff,et al.  Longitudinal phonetic variation in a closed system , 2009 .

[7]  Bruce L. Smith Effects of Place of Articulation and Vowel Environment on "Voiced" Stop Consonant Production. , 1978 .

[8]  Jan Van der Spiegel,et al.  Auditory-based acoustic-phonetic signal processing for robust continuous speech recognition , 1999 .

[9]  A. Caramazza,et al.  Voice onset time in two French dialects , 1974 .

[10]  L. Lisker,et al.  A Cross-Language Study of Voicing in Initial Stops: Acoustical Measurements , 1964 .

[11]  Alexander L. Francis,et al.  Accuracy and variability of acoustic measures of voicing onset. , 2003, The Journal of the Acoustical Society of America.

[12]  Yoram Singer,et al.  A Large Margin Algorithm for Speech-to-Phoneme and Music-to-Score Alignment , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[14]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[15]  Dani Byrd,et al.  Phonetic analyses of word and segment variation using the TIMIT corpus of American english , 1994, Speech Commun..

[16]  Mark A. Randolph,et al.  Syllable-based constraints on properties of English sounds , 1989 .

[17]  Francis Eustache,et al.  Voice onset time in aphasia, apraxia of speech and dysarthria: a review , 2000 .

[18]  David J. Ostry,et al.  Cross language phonetic influences on the speech of French-English bilinguals , 2008, J. Phonetics.

[19]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[20]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[21]  Hugo Van hamme,et al.  Automatic voice onset time estimation from reassignment spectra , 2009, Speech Commun..

[22]  Taehong Cho,et al.  Variation and universals in VOT: evidence from 18 languages , 1999 .

[23]  Yoram Singer,et al.  Learning to Align Polyphonic Music , 2004, ISMIR.

[24]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[25]  Partha Niyogi,et al.  Incorporating voice onset time to improve letter recognition accuracies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  John H. L. Hansen,et al.  Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification , 2010, Speech Commun..

[27]  Hsiao-Chuan Wang,et al.  Automatic estimation of voice onset time for word-initial stops by applying random forest to onset detection. , 2011, The Journal of the Acoustical Society of America.

[28]  Abeer Alwan,et al.  Automatic detection of voice onset time contrasts for use in pronunciation assessment , 2006, INTERSPEECH.

[29]  Nattalia Paterson,et al.  Interactions in Bilingual Speech Processing , 2011 .