Exploiting articulatory features for pitch accent detection

Articulatory features describe how articulators are involved in making sounds. Speakers often use a more exaggerated way to pronounce accented phonemes, so articulatory features can be helpful in pitch accent detection. Instead of using the actual articulatory features obtained by direct measurement of articulators, we use the posterior probabilities produced by multi-layer perceptrons (MLPs) as articulatory features. The inputs of MLPs are frame-level acoustic features pre-processed using the split temporal context-2 (STC-2) approach. The outputs are the posterior probabilities of a set of articulatory attributes. These posterior probabilities are averaged piecewise within the range of syllables and eventually act as syllable-level articulatory features. This work is the first to introduce articulatory features into pitch accent detection. Using the articulatory features extracted in this way, together with other traditional acoustic features, can improve the accuracy of pitch accent detection by about 2%.

[1]  Donna Erickson,et al.  Articulation of Extreme Formant Patterns for Emphasized Vowels , 2002, Phonetica.

[2]  John S. D. Mason,et al.  Deriving articulatory representations from speech with various excitation modes , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[4]  John H. L. Hansen,et al.  Automatic language analysis and identification based on speech production knowledge , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Liu,et al.  Multi-Stream Posterior Features and Combining Subspace Gmms for Low Resource LVCSR , 2013 .

[6]  Tsuneo Nitta,et al.  Pronunciation Instruction using CG Animation based on Articulatory Features , 2010 .

[7]  Yang Liu,et al.  Automatic prosodic event detection using a novel labeling and selection method in co-training , 2012, Speech Commun..

[8]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[9]  H. Timothy Bunnell,et al.  Articulatory features for expressive speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[11]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[12]  Yang Liu,et al.  Semi-supervised Learning for Automatic Prosodic Event Detection Using Co-training Algorithm , 2009, ACL.

[13]  Jia Liu,et al.  Automatic lexical stress detection using acoustic features for computer-assisted language learning , 2011 .

[14]  Tsuneo Nitta,et al.  Real-time Visualization of English Pronunciation on an IPA Chart Based on Articulatory Feature Extraction , 2012, INTERSPEECH.

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  John S. D. Mason,et al.  Vocal tract shape trajectory estimation using MLP analysis-by-synthesis , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  L. Goldstein,et al.  Manifestation of prosodic structure in articulatory variation: Evidence from lip kinematics in English , 2006 .

[19]  Yang Liu,et al.  Syllable-level prominence detection with acoustic evidence , 2010, INTERSPEECH.

[20]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[21]  Jia Liu,et al.  Articulatory Feature based Multilingual MLPs for Low-Resource Speech Recognition , 2012, INTERSPEECH.

[22]  C. Fougeron,et al.  Prosodically conditioned articulatory variations: A review , 1999 .

[23]  Jia Liu,et al.  Cross-Lingual and Ensemble MLPs Strategies for Low-Resource Speech Recognition , 2012, INTERSPEECH.

[24]  Man Mohan Sondhi,et al.  Techniques for estimating vocal-tract shapes from the speech signal , 1994, IEEE Trans. Speech Audio Process..

[25]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[26]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[27]  Chiu-yu Tseng,et al.  Studying L2 suprasegmental features in asian Englishes: a position paper , 2009, INTERSPEECH.

[28]  John H. L. Hansen,et al.  Automatic analysis of Mandarin accented English using phonological features , 2012, Speech Commun..

[29]  Sacha Krstulovic LPC-based inversion of the DRM articulatory model , 1999, EUROSPEECH.

[30]  Chao Ha Improved tone modeling by exploiting articulatory features for Mandarin speech recognition , 2013 .

[31]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[32]  Chin-Hui Lee,et al.  Toward a detector-based universal phone recognizer , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Vincent J. van Heuven,et al.  Acoustic correlates of linguistic stress and accent in Dutch and American English , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  Jia Liu,et al.  State-Level Data Borrowing for Low-Resource Speech Recognition Based on Subspace GMMs , 2011, INTERSPEECH.

[35]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[36]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[37]  Yang Liu,et al.  Automatic prosodic events detection using syllable-based acoustic and syntactic features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.