Using Vision and Speech Features for Automated Prediction of Performance Metrics in Multimodal Dialogs

Predicting and analyzing multimodal dialog user experience (UX) metrics, such as overall call experience, caller engagement, and latency, among other metrics, in an ongoing manner is important for evaluating such systems. We investigate automated prediction of multiple such metrics collected from crowdsourced interactions with an open-source, cloud-based multimodal dialog system in the educational domain. We extract features from both the audio and video signals and examine the efficacy of multiple machine learning algorithms in predicting these performance metrics. The best performing audio features consist of multiple low-level audio descriptors—intensity, loudness, cepstra, pitch, and so on—and their functionals, extracted using the OpenSMILE toolkit, while the video features are bags of visual words that use 3D Scale-Invariant Feature Transform descriptors. We find that our proposed methods outperform the majority vote classification baseline in predicting various UX metrics rated by both the user and experts. Our results suggest that such automated prediction of performance metrics can not only inform the qualitative and quantitative analysis of dialogs but also be potentially incorporated into dialog management routines for positively impacting UX and other metrics during the course of the interaction.

[1]  Lei Chen,et al.  Applying Rhythm Features to Automatically Assess Non-Native Speech , 2011, INTERSPEECH.

[2]  Gina-Anne Levow,et al.  Predicting User Satisfaction in Spoken Dialog System Evaluation With Collaborative Filtering , 2012, IEEE Journal of Selected Topics in Signal Processing.

[3]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[4]  Florian Eyben,et al.  the Munich open Speech and Music Interpretation by Large Space Extraction toolkit , 2010 .

[5]  Helen F. Hastie,et al.  A survey on metrics for the evaluation of user simulations , 2012, The Knowledge Engineering Review.

[6]  David Suendermann-Oeft,et al.  Multimodal HALEF: An Open-Source Modular Web-Based Multimodal Dialog Framework , 2016, IWSDS.

[7]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[8]  Xiaoming Xi,et al.  Towards Using Structural Events To Assess Non-native Speech , 2010 .

[9]  Xiaoming Xi,et al.  Improved pronunciation features for construct-driven assessment of non-native spontaneous speech , 2009, HLT-NAACL.

[10]  Morena Danieli,et al.  Metrics for Evaluating Dialogue Strategies in a Spoken Language System , 1996, ArXiv.

[11]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[12]  James R. Glass,et al.  Collecting Voices from the Cloud , 2010, LREC.

[13]  R. Pieraccini,et al.  “How am I Doing?”: A New Framework to Effectively Measure the Performance of Automated Customer Care Contact Centers , 2010 .

[14]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[15]  Sebastian Möller,et al.  Quality of Telephone-Based Spoken Dialogue Systems , 2005 .

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Diane J. Litman,et al.  Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor , 2011, Speech Commun..

[18]  David Suendermann-Oeft,et al.  Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System , 2017 .

[19]  David Suendermann-Oeft,et al.  Caller Experience: A method for evaluating dialog systems and its automatic prediction , 2008, 2008 IEEE Spoken Language Technology Workshop.

[20]  Ian Frank,et al.  For a fistful of dollars: using crowd-sourcing to evaluate a spoken language CALL application , 2011, SLaTE.

[21]  Imed Zitouni,et al.  Automatic Online Evaluation of Intelligent Assistants , 2015, WWW.

[22]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[23]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[24]  David Suendermann-Oeft,et al.  HALEF: An Open-Source Standard-Compliant Telephony-Based Modular Spoken Dialog System: A Review and An Outlook , 2015, IWSDS.

[25]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[26]  Su-Youn Yoon,et al.  Acoustic Feature-based Non-scorable Response Detection for an Automated Speaking Proficiency Assessment , 2012, INTERSPEECH.

[27]  Kalina Bontcheva,et al.  Human Language Technologies , 2009, Semantic Knowledge Management.

[28]  Wolfgang Minker,et al.  Modeling and Predicting Quality in Spoken Human-Computer Interaction , 2011, SIGDIAL Conference.

[29]  Xiaoming Xi,et al.  A three-stage approach to the automated scoring of spontaneous spoken responses , 2011, Comput. Speech Lang..

[30]  Diane J. Litman,et al.  Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources , 2004, NAACL.

[31]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[32]  Jeremy H. Wright,et al.  Using Natural Language Processing and Discourse Features to Identify Understanding Errors in a Spoken Dialogue System , 2000 .

[33]  Silke M. Witt,et al.  Use of speech recognition in computer-assisted language learning , 2000 .

[34]  Su-Youn Yoon,et al.  Application of Structural Events Detected on ASR Outputs for Automated Speaking Assessment , 2012, INTERSPEECH.

[35]  Milica Gasic,et al.  Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk , 2011, INTERSPEECH.