An empirical investigation of sparse log-linear models for improved dialogue act classification

Previous work on dialogue act classification have primarily focused on dense generative and discriminative models. However, since the automatic speech recognition (ASR) outputs are often noisy, dense models might generate biased estimates and overfit to the training data. In this paper, we study sparse modeling approaches to improve dialogue act classification, since the sparse models maintain a compact feature space, which is robust to noise. To test this, we investigate various element-wise frequentist shrinkage models such as lasso, ridge, and elastic net, as well as structured sparsity models and a hierarchical sparsity model that embed the dependency structure and interaction among local features. In our experiments on a real-world dataset, when augmenting N-best word and phone level ASR hypotheses with confusion network features, our best sparse log-linear model obtains a relative improvement of 19.7% over a rule-based baseline, a 3.7% significant improvement over a traditional non-sparse log-linear model, and outperforms a state-of-the-art SVM model by 2.2%.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  E. Levin,et al.  CHRONUS, The next generation , 1995 .

[3]  Dilek Z. Hakkani-Tür,et al.  The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[4]  Ye-Yi Wang,et al.  Spoken language understanding , 2005, IEEE Signal Processing Magazine.

[5]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[6]  William Yang Wang,et al.  “Love ya, jerkface”: Using Sparse Log-Linear Models to Build Positive and Impolite Relationships with Teens , 2012, SIGDIAL Conference.

[7]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[8]  Shrikanth S. Narayanan,et al.  Combining lexical, syntactic and prosodic cues for improved online dialog act tagging , 2009, Comput. Speech Lang..

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[11]  Dong Yu,et al.  A Discriminative Training Framework using N-Best Speech Recognition Transcriptions and Scores for Spoken Utterance Classification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[13]  Anton Leuski,et al.  Improving Spoken Dialogue Understanding Using Phonetic Mixture Models , 2011, FLAIRS.

[14]  Alex Acero,et al.  Speech utterance classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Hiyan Alshawi,et al.  Effective Utterance Classification with Unsupervised Phonotactic Models , 2003, NAACL.

[16]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[18]  William Yang Wang,et al.  Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model , 2012, ACL.

[19]  Elizabeth Shriberg,et al.  Automatic dialog act segmentation and classification in multiparty meetings , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[21]  Gökhan Tür,et al.  Cascaded model adaptation for dialog act segmentation and tagging , 2010, Comput. Speech Lang..

[22]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[23]  Hermann Ney,et al.  Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Matthew Henderson,et al.  Discriminative spoken language understanding using word confusion networks , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[25]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[26]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[27]  Wayne H. Ward Extracting information in spontaneous speech , 1994, ICSLP.

[28]  Richard M. Schwartz,et al.  Language understanding using hidden understanding models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  Gökhan Tür,et al.  The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[31]  Stephen T. Wu,et al.  A Framework for Fast Incremental Interpretation during Speech Decoding , 2009, Computational Linguistics.

[32]  Alexander I. Rudnicky,et al.  N-best speech hypotheses reordering using linear regression , 2001, INTERSPEECH.

[33]  Rosalind W. Picard,et al.  Dialog Act Classification from Prosodic Features Using Support Vector Machines , 2002 .

[34]  A. Stolcke,et al.  Automatic detection of discourse structure for speech recognition and understanding , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[35]  Eric K. Ringger,et al.  A Robust System for Natural Spoken Dialogue , 1996, ACL.

[36]  Julia Hirschberg,et al.  Detecting Levels of Interest from Spoken Dialog with Multistream Prediction Feedback and Similarity Based Hierarchical Fusion Learning , 2011, SIGDIAL Conference.

[37]  Qiang Huang,et al.  Task-independent call-routing , 2006, Speech Commun..

[38]  Matthias Zimmermann,et al.  Joint segmentation and classification of dialog acts using conditional random fields , 2009, INTERSPEECH.

[39]  Steve J. Young,et al.  Spoken language understanding using the Hidden Vector State Model , 2006, Speech Commun..

[40]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.