Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild

Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic speech. The proposed system addresses two critical issues: it can measure positive and negative VOT equally well, and it is trained to be robust to variation across annotations. Our approach is based on the structured prediction framework, where the feature functions are defined to be RNNs. These learn to capture segmental variation in the signal. Results suggest that our method substantially improves over the current state-of-the-art. In contrast to previous work, our Deep and Robust VOT annotator, Dr.VOT, can successfully estimate negative VOTs while maintaining state-of-the-art performance on positive VOTs. This high level of performance generalizes to new corpora without further retraining. Index Terms: structured prediction, multi-task learning, adversarial training, recurrent neural networks, sequence segmentation.

[1]  Matthew Goldrick,et al.  Mechanisms of interaction in speech production , 2009, Language and cognitive processes.

[2]  L. Lisker,et al.  A Cross-Language Study of Voicing in Initial Stops: Acoustical Measurements , 1964 .

[3]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[4]  Joseph Keshet,et al.  Sequence segmentation using joint RNN and structured prediction models , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  John H. L. Hansen,et al.  Automatic voice onset time detection for unvoiced stops (/p/, /t/, /k/) with application to accent classification , 2010, Speech Commun..

[6]  Matthew Goldrick,et al.  The effects of lexical neighbors on stop consonant articulation. , 2013, The Journal of the Acoustical Society of America.

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Amanda A. Shultz Individual differences in cue weighting of stop consonant voicing in perception and production , 2011 .

[13]  Morgan Sonderegger,et al.  The medium-term dynamics of accents on reality television , 2017 .

[14]  Hugo Van hamme,et al.  Automatic voice onset time estimation from reassignment spectra , 2009, Speech Commun..

[15]  Olga Dmitrieva,et al.  First language phonetic drift in second language instructional environment , 2018 .

[16]  Joseph Keshet,et al.  Vowel duration measurement using deep neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[17]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[18]  Partha Niyogi,et al.  The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets , 2003, Speech Commun..

[19]  Joseph Keshet,et al.  Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production , 2016, Cognition.

[20]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[21]  Morgan Sonderegger,et al.  Automatic measurement of voice onset time using discriminative structured prediction. , 2012, The Journal of the Acoustical Society of America.

[22]  Fernando Llanos,et al.  Differential cue weighting in perception and production of consonant voicing. , 2012, The Journal of the Acoustical Society of America.

[23]  Arthur S. Abramson,et al.  Voice Onset Time (VOT) at 50: Theoretical and practical issues in measuring voicing distinctions , 2017, J. Phonetics.

[24]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[25]  Thierry Dutoit,et al.  Multi-task learning for speech recognition: an overview , 2016, ESANN.

[26]  Nattalia Paterson,et al.  Interactions in Bilingual Speech Processing , 2011 .

[27]  Bhiksha Raj,et al.  The relationship of voice onset time and Voice Offset Time to physical age , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Joseph Keshet,et al.  Automatic Measurement of Voice Onset Time and Prevoicing Using Recurrent Neural Networks , 2016, INTERSPEECH.

[29]  M. Tanenhaus,et al.  Dynamically adapted context-specific hyper-articulation: Feedback from interlocutors affects speakers' subsequent pronunciations. , 2016, Journal of memory and language.

[30]  Francis Eustache,et al.  Voice onset time in aphasia, apraxia of speech and dysarthria: a review , 2000 .

[31]  Jessica A Barlow,et al.  Age-related changes in acoustic characteristics of adult speech. , 2009, Journal of communication disorders.

[32]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[33]  Taehong Cho,et al.  Variation and universals in VOT: evidence from 18 languages , 1999 .