Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system

It is not always clear how the differences in intrinsic evaluation metrics for a parser or classifier will affect the performance of the system that uses it. We investigate the relationship between the intrinsic evaluation scores of an interpretation component in a tutorial dialogue system and the learning outcomes in an experiment with human users. Following the PARADISE methodology, we use multiple linear regression to build predictive models of learning gain, an important objective outcome metric in tutorial dialogue. We show that standard intrinsic metrics such as F-score alone do not predict the outcomes well. However, we can build predictive performance functions that account for up to 50% of the variance in learning gain by combining features based on standard evaluation scores and on the confusion matrix entries. We argue that building such predictive models can help us better evaluate performance of NLP components that cannot be distinguished based on F-score alone, and illustrate our approach by comparing the current interpretation component in the system to a new classifier trained on the evaluation data.

[1]  C. Searle A common language , 1983 .

[2]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[3]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[4]  Lars Bo Larsen,et al.  Issues in the evaluation of spoken dialogue systems using objective and subjective measures , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[5]  Pamela W. Jordan,et al.  Understanding Complex Natural Language Explanations in Tutorial Applications , 2006 .

[6]  Diane J. Litman,et al.  Modelling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters , 2006, NAACL.

[7]  Diane J. Litman,et al.  Exploiting Discourse Structure for Spoken Dialogue Performance Analysis , 2006, EMNLP.

[8]  Gökhan Tür,et al.  The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Roberto Pieraccini,et al.  Technical Support Dialog Systems:Issues, Problems, and Solutions , 2007, HLT-NAACL 2007.

[10]  Joel R. Tetreault,et al.  Comparing Linguistic Features for Modeling Learning in Computer Tutoring , 2007, AIED.

[11]  Sebastian Möller,et al.  Evaluating spoken dialogue systems according to de-facto standards: A case study , 2007, Comput. Speech Lang..

[12]  Sebastian Möller,et al.  Predicting the quality and usability of spoken dialogue services , 2008, Speech Commun..

[13]  Rodney D. Nielsen,et al.  Learning to Assess Low-Level Conceptual Understanding , 2008, FLAIRS Conference.

[14]  Jun'ichi Tsujii,et al.  Task-oriented Evaluation of Syntactic Parsers and Their Representations , 2008, ACL.

[15]  Johanna D. Moore,et al.  The "DeMAND" coding scheme: A "common language" for representing and analyzing student discourse , 2009, AIED.

[16]  Johanna D. Moore,et al.  Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains Modalities , 2009, AIED.

[17]  Johanna D. Moore,et al.  Dealing with Interpretation Errors in Tutorial Dialogue , 2009, SIGDIAL Conference.

[18]  Johanna D. Moore,et al.  Beetle II: A System for Tutoring and Computational Linguistics Experimentation , 2010, ACL.

[19]  Johanna D. Moore,et al.  Intelligent Tutoring with Natural Language Support in the Beetle II System , 2010, EC-TEL.

[20]  Dan Roth,et al.  “Ask Not What Textual Entailment Can Do for You...” , 2010, ACL.

[21]  Deniz Yuret,et al.  SemEval-2010 Task 12: Parser Evaluation Using Textual Entailments , 2010, *SEMEVAL.