Which System Differences Matter? Using L1/L2 Regularization to Compare Dialogue Systems

We investigate how to jointly explain the performance and behavioral differences of two spoken dialogue systems. The Join Evaluation and Differences Identification (JEDI), finds differences between systems relevant to performance by formulating the problem as a multi-task feature selection question. JEDI provides evidence on the usefulness of a recent method, l1/lp-regularized regression (Obozinski et al., 2007). We evaluate against manually annotated success criteria from real users interacting with five different spoken user interfaces that give bus schedule information.

[1]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[2]  Victor Zue,et al.  Experiments in Evaluating Interactive Spoken Language Systems , 1992, HLT.

[3]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[4]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[5]  Jeremy H. Wright,et al.  Automatically Training a Problematic Dialogue Predictor for a Spoken Dialogue System , 2011, J. Artif. Intell. Res..

[6]  Tim Paek Empirical Methods for Evaluating Dialog Systems , 2001, SIGDIAL Workshop.

[7]  Helen F. Hastie,et al.  Automatic Evaluation: Using a DATE Dialogue Act Tagger for User Satisfaction and Task Completion Prediction , 2002, LREC.

[8]  David Suendermann-Oeft,et al.  Is it possible to predict task completion in automated troubleshooters? , 2010, INTERSPEECH.

[9]  Eric P. Xing,et al.  Multi-population GWA mapping via multi-task regularized regression , 2010, Bioinform..

[10]  Jack Mostow,et al.  Classifying dialogue in high-dimensional space , 2011, TSLP.

[11]  Antinus Nijholt,et al.  Formal Semantics and Pragmatics of Dialogue , 1998 .

[12]  Marilyn A. Walker,et al.  Towards developing general models of usability with PARADISE , 2000, Natural Language Engineering.

[13]  Sebastian Möller,et al.  Predicting the quality and usability of spoken dialogue services , 2008, Speech Commun..

[14]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[17]  Alan W. Black,et al.  Describing Spoken Dialogue Systems Differences , 2008 .

[18]  Hua Ai,et al.  Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.

[19]  Mark W. Schmidt,et al.  Structure learning in random fields for heart motion abnormality detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[21]  R. Pieraccini,et al.  “How am I Doing?”: A New Framework to Effectively Measure the Performance of Automated Customer Care Contact Centers , 2010 .

[22]  Johan Schalkwyk,et al.  Deploying GOOG-411: Early lessons in data, measurement, and testing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Roberto Pieraccini,et al.  VALUE-BASED OPTIMAL DECISION FOR DIALOG SYSTEMS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[24]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[25]  Melita Hajdinjak,et al.  The PARADISE Evaluation Framework: Issues and Findings , 2006, Computational Linguistics.

[26]  Sebastian Möller,et al.  Evaluating spoken dialogue systems according to de-facto standards: A case study , 2007, Comput. Speech Lang..

[27]  Alexander I. Rudnicky,et al.  Olympus: an open-source framework for conversational spoken language interface research , 2007, HLT-NAACL 2007.