Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales

Automatically evaluating pronunciation quality of non-native speech has seen tremendous success in both research and commercial settings, with applications in L2 learning. In this paper, submitted for the INTERSPEECH 2015 Degree of Nativeness Sub-Challenge, this problem is posed under a challenging crosscorpora setting using speech data drawn from multiple speakers from a variety of language backgrounds (L1) reading different English sentences. Since the perception of non-nativeness is realized at the segmental and suprasegmental linguistic levels, we explore a number of acoustic cues at multiple time scales. We experiment with both data-driven and knowledge-inspired features that capture degree of nativeness from pauses in speech, speaking rate, rhythm/stress, and goodness of phone pronunciation. One promising finding is that highly accurate automated assessment can be attained using a small diverse set of intuitive and interpretable features. Performance is further boosted by smoothing scores across utterances from the same speaker; our best system significantly outperforms the challenge baseline.

[1]  Gina-Anne Levow,et al.  Investigating Pitch Accent Recognition in Non-native Speech , 2009, ACL.

[2]  Irina Illina,et al.  Foreign accent identification based on prosodic parameters , 2008, INTERSPEECH.

[3]  Shrikanth S. Narayanan,et al.  Classifying language-related developmental disorders from speech cues: the promise and the potential confounds , 2013, INTERSPEECH.

[4]  Isabel Trancoso,et al.  A nativeness classifier for TED Talks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  John H. L. Hansen,et al.  Foreign accent classification using source generator based prosodic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[7]  M P Black,et al.  Automatic Prediction of Children's Reading Ability for High-Level Literacy Assessment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Elmar Nöth,et al.  Does it Groove or does it Stumble - Automatic Classification of Alcoholic Intoxication using Prosodic Features , 2011, INTERSPEECH.

[9]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[10]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[11]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[12]  Helmer Strik,et al.  The effectiveness of computer-based speech corrective feedback for improving segmental quality in l2 dutch , 2008, ReCALL.

[13]  References , 1971 .

[14]  Bernard Zenko,et al.  Is Combining Classifiers with Stacking Better than Selecting the Best One? , 2004, Machine Learning.

[15]  A. Willsky,et al.  A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems , 1976 .

[16]  F. Ramus Acoustic correlates of linguistic rhythm: Perspectives , 2002 .

[17]  E. Nöth,et al.  Automatic Assessment of Non-Native Prosody for English as L 2 , 2010 .

[18]  F. June Automatic Assessment of Non-Native Prosody – Annotation , Modelling and Evaluation , 2012 .

[19]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[20]  E. Grabe,et al.  Durational variability in speech and the rhythm class hypothesis , 2005 .

[21]  Elmar Nöth,et al.  The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, parkinson's & eating condition , 2015, INTERSPEECH.

[22]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[23]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Elmar Nöth,et al.  Islands of failure: employing word accent information for pronunciation quality assessment of English L2 learners , 2009, SLaTE.

[25]  Athanasios Katsamanis,et al.  Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features , 2013, Speech Commun..

[26]  Lei Chen,et al.  Assessment of non-native speech using vowel space characteristics , 2010, 2010 IEEE Spoken Language Technology Workshop.

[27]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[28]  Jack Mostow,et al.  Two methods for assessing oral reading prosody , 2011, TSLP.

[29]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[30]  Shrikanth S. Narayanan,et al.  Intoxicated speech detection: A fusion framework with speaker-normalized hierarchical functionals and GMM supervectors , 2014, Comput. Speech Lang..

[31]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Abeer Alwan,et al.  A Bayesian network classifier for word-level reading assessment , 2007, INTERSPEECH.