Unsupervised Modeling of Twitter Conversations

We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned model can provide insight into the shape of communication in a new medium. We address the challenge of evaluating the emergent model with a qualitative visualization and an intrinsic conversation ordering task. This work is inspired by a corpus of 1.3 million Twitter conversations, which will be made publicly available. This huge amount of data, available only because Twitter blurs the line between chatting and publishing, highlights the need to be able to adapt quickly to a new medium.

[1]  Monika Woszczyna,et al.  Inferring linguistic structure in spoken language , 1994, ICSLP.

[2]  Elizabeth Shriberg,et al.  Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual , 1997 .

[3]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[4]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[5]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[8]  Tom M. Mitchell,et al.  Learning to Classify Email into “Speech Acts” , 2004, EMNLP.

[9]  Elizabeth Shriberg,et al.  Meeting Recorder Project: Dialog Act Labeling Guide , 2004 .

[10]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Min-Yen Kan Optimizing predictive text entry for short message service on mobile phones 1 , 2005 .

[12]  Johanna D. Moore,et al.  Incorporating Speaker and Discourse Features into Speech Summarization , 2006, NAACL.

[13]  Yorick Wilks,et al.  Artificial Companions as a New Kind of Interface to the Future Internet , 2006 .

[14]  Daniel Marcu,et al.  Bayesian Query-Focused Summarization , 2006, ACL.

[15]  Nathanael Chambers,et al.  PLOW: A Collaborative Task Learning Agent , 2007, AAAI.

[16]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[17]  Craig H. Martell,et al.  Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[18]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[19]  Michael Gamon,et al.  BLEWS: Using Blogs to Provide Context for News Articles , 2008, ICWSM.

[20]  Micha Elsner,et al.  You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement , 2008, ACL.

[21]  Noah A. Smith,et al.  Predicting Response to Political Blog Posts with Topic Models , 2009, NAACL.

[22]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[23]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[24]  Daniel Jurafsky,et al.  It’s Not You, it’s Me: Detecting Flirting and its Misperception in Speed-Dates , 2009, EMNLP.

[25]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.

[26]  Stephen G. Pulman,et al.  Unsupervised Classification of Dialogue Acts using a Dirichlet Process Mixture Model , 2009, SIGDIAL Conference.

[27]  Gary Geunbae Lee,et al.  Semi-supervised Speech Act Recognition in Emails and Forums , 2009, EMNLP.