Performance Comparison of Machine Learning Models Trained on Manual vs ASR Transcriptions for Dialogue Act Annotation

Automatic dialogue act annotation of speech utterances is an important task in human-agent interaction in order to correctly interpret user utterances. Speech utterances can be transcribed manually or via Automatic Speech Recognizer (ASR). In this article, several Machine Learning models are trained on manual and ASR transcriptions of user utterances, using bag of words and n-grams feature generation approaches, and evaluated on ASR transcribed test set. Results show that models trained using ASR transcriptions perform better than algorithms trained on manual transcription. The impact of irregular distribution of dialogue acts on the accuracy of statistical models is also investigated, and a partial solution to this issue is shown using multimodal information as input.

[1]  Jean Carletta,et al.  The NITE XML Toolkit: Demonstration from five corpora , 2006, NLPXML@EACL.

[2]  Timothy Baldwin,et al.  Classifying Dialogue Acts in Multi-party Live Chats , 2012, PACLIC.

[3]  Mingyong Liu,et al.  An improvement of TFIDF weighting in text categorization , .

[4]  Emi Fujioka,et al.  The Role and Identification of Dialog Acts in Online Chat , 2011, Analyzing Microtext.

[5]  Shrikanth S. Narayanan,et al.  A dialog act tagging approach to behavioral coding: a case study of addiction counseling conversations , 2015, INTERSPEECH.

[6]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[7]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[8]  Craig H. Martell,et al.  Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[9]  Dirk Heylen,et al.  DIALOGUE-ACT TAGGING USING SMART FEATURE SELECTION; RESULTS ON MULTIPLE CORPORA , 2006, 2006 IEEE Spoken Language Technology Workshop.

[10]  Barbara Di Eugenio,et al.  Multimodality and Dialogue Act Classification in the RoboHelper Project , 2013, SIGDIAL Conference.

[11]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Faisal M. Khan,et al.  Mining Chat-room Conversations for Social and Semantic Interactions , 2002 .

[13]  Kai Wang,et al.  Exploiting Salient Patterns for Question Detection and Question Retrieval in Community-based Question Answering , 2010, COLING.

[14]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Rong Jin,et al.  Understanding bag-of-words model: a statistical framework , 2010, Int. J. Mach. Learn. Cybern..

[16]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[18]  Edward Y. Chang,et al.  Question identification on twitter , 2011, CIKM '11.

[19]  Dietrich Klakow,et al.  Creating Annotated Dialogue Resources: Cross-domain Dialogue Act Classification , 2016, LREC.

[20]  Harry Bunt,et al.  The DIT++ taxanomy for functional dialogue markup , 2009 .

[21]  Wenjie Li,et al.  What Are Tweeters Doing: Recognizing Speech Acts in Twitter , 2011, Analyzing Microtext.

[22]  Kristy Elizabeth Boyer,et al.  Predicting Dialogue Acts for Intelligent Virtual Agents with Multimodal Student Interaction Data , 2016, EDM.

[23]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[24]  Eric N. Forsyth Improving automated lexical and discourse analysis of online chat dialog , 2007 .