论文信息 - Vowel duration measurement using deep neural networks

Vowel duration measurement using deep neural networks

Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.

[1] Paul Boersma,et al. Praat, a system for doing phonetics by computer , 2002 .

[2] Harris Drucker,et al. Comparison of learning algorithms for handwritten digit recognition , 1995 .

[3] Keelan Evanini. The Permeability of Dialect Boundaries: A Case Study of the Region Surrounding Erie, Pennsylvania. , 2009 .

[4] Elisabeth Dévière,et al. Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[5] B. Munson,et al. The effect of phonological neighborhood density on vowel articulation. , 2004, Journal of speech, language, and hearing research : JSLHR.

[6] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7] Keith Johnson,et al. Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech , 2012 .

[8] Matthew Goldrick,et al. Grammatical constraints on phonological encoding in speech production , 2014, Psychonomic bulletin & review.

[9] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[10] Mark Liberman,et al. Speaker identification on the SCOTUS corpus , 2008 .

[11] G. E. Peterson,et al. Duration of Syllable Nuclei in English , 1960 .

[12] W. Labov,et al. One Hundred Years of Sound Change in Philadelphia: Linear Incrementation, Reversal, and Reanalysis , 2013 .

[13] D. Barr,et al. Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[14] Rasmus Berg Palm,et al. Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[15] Paul Smolensky,et al. Information processing in dynamical systems: foundations of harmony theory , 1986 .

[16] Matthew Goldrick,et al. Erratum to: Grammatical constraints on phonological encoding in speech production , 2015, Psychonomic Bulletin & Review.

[17] R. Baayen,et al. Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[18] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19] Mark Liberman,et al. F0 declination in English and Mandarin Broadcast News Speech , 2014, Speech Commun..