Will this dialogue be unsuccessful ? Prediction using audio features

This paper proposes a method to improve statistical spoken dialogue systems and specifically it aims to provide a way for early detection of unsuccessful dialogues using the audio stream. If an interaction is predicted as unsuccessful, this information could be used to update the policy or to forward the call to a human agent. A dataset of interactions between Amazon Mechanical Turk workers and a statistical spoken dialogue system is used. A total of 702 dialogues are recorded. Then, mel-frequency cepstral coefficients (MFCCs) are extracted from the user’s speech signal, forming a “feature image” which is then given as input to a convolutional neural network comprising of 9 layers. The reported accuracy is 94.7%, and the system manages to predict that a dialogue will be unsuccessful for the 97.9% of the cases. With respect to accuracy, there is an improvement of 17.2%, compared to our previous work on predicting dialogue quality. We observe that for our task, convolutional neural networks can model temporal correlations given context information and that the cepstral domain is a useful and compact representation for convolutional neural networks.

[1]  Marilyn A. Walker,et al.  Learning to Predict Problematic Situations in a Spoken Dialogue System: Experiments with How May I Help You? , 2000, ANLP.

[2]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[3]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[4]  Dongho Kim,et al.  On-line policy optimisation of Bayesian spoken dialogue systems via human interaction , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[6]  Rosalind W. Picard,et al.  Dialog Act Classification from Prosodic Features Using Support Vector Machines , 2002 .

[7]  Horacio Franco,et al.  Time-frequency convolutional networks for robust speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  David R. Traum,et al.  "yeah Right": Sarcasm Recognition for Spoken Dialogue Systems , 2006, INTERSPEECH.

[9]  David Vandyke,et al.  Multi-domain dialogue success classifiers for policy training , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  David Vandyke,et al.  Learning from real users: rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems , 2015, INTERSPEECH.

[11]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Romain Laroche,et al.  Task Completion Transfer Learning for Reward Inference , 2014, AAAI 2014.

[13]  Ramón López-Cózar,et al.  A new method for testing dialogue systems based on simulations of real-world conditions , 2002, INTERSPEECH.

[14]  Kiyohiro Shikano,et al.  Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs , 2004, INTERSPEECH.

[15]  Marilyn A. Walker,et al.  Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems , 2001, ACL.

[16]  Julia Hirschberg,et al.  Implementing Acoustic-Prosodic Entrainment in a Conversational Avatar , 2016, INTERSPEECH.

[17]  Peter Bell,et al.  Towards automatic detection of reported speech in dialogue using prosodic cues , 2015, INTERSPEECH.

[18]  Shimei Pan,et al.  Designing and Evaluating an Adaptive Spoken Dialogue System , 2002, User Modeling and User-Adapted Interaction.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Emiel Krahmer,et al.  Multi-feature Error Detection in Spoken Dialogue Systems , 2001, CLIN.

[21]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Tetsunori Kobayashi,et al.  Back-channel feedback generation using linguistic and nonlinguistic information and its application to spoken dialogue system , 2005, INTERSPEECH.

[23]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[24]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[25]  Fernando Batista,et al.  Detecting repetitions in spoken dialogue systems using phonetic distances , 2015, INTERSPEECH.

[26]  Yannis Stylianou,et al.  Predicting dialogue success, naturalness, and length with acoustic features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[29]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[30]  David Griol,et al.  Evaluating the Conversational Interface , 2016 .

[31]  Elmar Nöth,et al.  Looking at the Last Two Turns, I'd Say This Dialogue Is Doomed - Measuring Dialogue Success , 2004, TSD.