Vowel duration measurement using deep neural networks

Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.

[1]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[2]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[3]  Keelan Evanini The Permeability of Dialect Boundaries: A Case Study of the Region Surrounding Erie, Pennsylvania. , 2009 .

[4]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[5]  B. Munson,et al.  The effect of phonological neighborhood density on vowel articulation. , 2004, Journal of speech, language, and hearing research : JSLHR.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Keith Johnson,et al.  Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech , 2012 .

[8]  Matthew Goldrick,et al.  Grammatical constraints on phonological encoding in speech production , 2014, Psychonomic bulletin & review.

[9]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[10]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[11]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[12]  W. Labov,et al.  One Hundred Years of Sound Change in Philadelphia: Linear Incrementation, Reversal, and Reanalysis , 2013 .

[13]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[14]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[15]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[16]  Matthew Goldrick,et al.  Erratum to: Grammatical constraints on phonological encoding in speech production , 2015, Psychonomic Bulletin & Review.

[17]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  Mark Liberman,et al.  F0 declination in English and Mandarin Broadcast News Speech , 2014, Speech Commun..