Human and Automated Scoring of Fluency, Pronunciation and Intonation During Human-Machine Spoken Dialog Interactions

We present a spoken dialog-based framework for the computerassisted language learning (CALL) of conversational English. In particular, we leveraged the open-source HALEF dialog framework to develop a job interview conversational application. We then used crowdsourcing to collect multiple interactions with the system from non-native English speakers. We analyzed human-rated scores of the recorded dialog data on three different scoring dimensions critical to the delivery of conversational English – fluency, pronunciation and intonation/stress – and further examined the efficacy of automatically-extracted, hand-curated speech features in predicting each of these subscores. Machine learning experiments showed that trained scoring models generally perform at par with the human inter-rater agreement baseline in predicting human-rated scores of conversational proficiency.

[1]  Anastassia Loukina,et al.  Feature selection for automated speech scoring , 2015, BEA@NAACL-HLT.

[2]  David Suendermann-Oeft,et al.  Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework , 2016, IWSDS.

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Dan Steinberg,et al.  Stochastic Gradient Boosting: An Introduction to TreeNet™ , 2002, AusDM.

[6]  David Suendermann-Oeft,et al.  Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System , 2017 .

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Arthur C. Graesser,et al.  Conversational Agents Can Provide Formative Assessment, Constructive Learning, and Adaptive Instruction , 2017 .

[9]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[10]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Xiaoming Xi,et al.  A comparison of two scoring methods for an automated speech scoring system , 2012 .

[12]  Xiaoming Xi,et al.  A three-stage approach to the automated scoring of spontaneous spoken responses , 2011, Comput. Speech Lang..

[13]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[14]  Mitch Weintraub,et al.  Automatic scoring of pronunciation quality , 2000, Speech Commun..

[15]  Xiaoming Xi,et al.  INVESTIGATING THE UTILITY OF ANALYTIC SCORING FOR THE TOEFL ACADEMIC SPEAKING TEST (TAST) , 2006 .

[16]  Xiaoming Xi,et al.  Improved pronunciation features for construct-driven assessment of non-native spontaneous speech , 2009, HLT-NAACL.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Su-Youn Yoon,et al.  Automatic assessment of syntactic complexity for spontaneous speech scoring , 2015, Speech Commun..

[19]  Mark J. F. Gales,et al.  Towards Using Conversations with Spoken Dialogue Systems in the Automated Assessment of Non-Native Speakers of English , 2016, SIGDIAL Conference.

[20]  Keelan Evanini,et al.  Content-Based Automated Assessment of Non-Native Spoken Language Proficiency in a Simulated Conversation , 2015 .

[21]  David Suendermann-Oeft,et al.  HALEF: An Open-Source Standard-Compliant Telephony-Based Modular Spoken Dialog System: A Review and An Outlook , 2015, IWSDS.

[22]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..