Topic Segmentation and Labeling in Asynchronous Conversations

Topic segmentation and labeling is often considered a prerequisite for higher-level conversation analysis and has been shown to be useful in many Natural Language Processing (NLP) applications. We present two new corpora of email and blog conversations annotated with topics, and evaluate annotator reliability for the segmentation and labeling tasks in these asynchronous conversations. We propose a complete computational framework for topic segmentation and labeling in asynchronous conversations. Our approach extends state-of-the-art methods by considering a fine-grained structure of an asynchronous conversation, along with other conversational features by applying recent graph-based methods for NLP. For topic segmentation, we propose two novel unsupervised models that exploit the fine-grained conversational structure, and a novel graph-theoretic supervised model that combines lexical, conversational and topic features. For topic labeling, we propose two novel (unsupervised) random walk models that respectively capture conversation specific clues from two different sources: the leading sentences and the fine-grained conversational structure. Empirical evaluation shows that the segmentation and the labeling performed by our best models beat the state-of-the-art, and are highly correlated with human annotations.

[1]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[2]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[3]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[4]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Shafiq R. Joty,et al.  Supervised Topic Segmentation of Email Conversations , 2011, ICWSM.

[7]  Carolyn Penstein Rosé,et al.  Hierarchical Conversation Structure Prediction in Multi-Party Chat , 2012, SIGDIAL Conference.

[8]  Shimei Pan,et al.  TIARA: Interactive, Topic-Based Visual Text Summarization and Analysis , 2012, TIST.

[9]  Jean Aitchison,et al.  Language and the Internet , 2002, Lit. Linguistic Comput..

[10]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[11]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[12]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[13]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[16]  Olena Medelyan,et al.  Human-competitive automatic topic indexing , 2009 .

[17]  Valerie Isham,et al.  Non‐Negative Matrices and Markov Chains , 1983 .

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[20]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[21]  Regina Barzilay,et al.  Bayesian Unsupervised Topic Segmentation , 2008, EMNLP.

[22]  Giuseppe Carenini,et al.  Summarizing email conversations with clue words , 2007, WWW '07.

[23]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[25]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[26]  Giuseppe Carenini,et al.  Summarizing Emails with Conversational Cohesion and Subjectivity , 2008, ACL.

[27]  Tilman Becker,et al.  Combining Multiple Information Layers for the Automatic Generation of Indicative Meeting Abstracts , 2007, ENLG.

[28]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[29]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[30]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[31]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[32]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[33]  Iryna Gurevych,et al.  Approximate Matching for Evaluating Keyphrase Extraction , 2009, RANLP.

[34]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[35]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[36]  Sanda M. Harabagiu,et al.  Topic themes for multi-document summarization , 2005, SIGIR '05.

[37]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[38]  Micha Elsner,et al.  Disentangling Chat with Local Coherence Models , 2011, ACL.

[39]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[40]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[41]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[42]  Rebecca J. Passonneau,et al.  Discourse Segmentation by Human and Automated Means , 1997, CL.

[43]  Erik Aumayr,et al.  Reconstruction of Threaded Conversations in Online Discussion Forums , 2011, ICWSM.

[44]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[45]  Jacob Eisenstein,et al.  Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion , 2009, NAACL.

[46]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[47]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[48]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[49]  Giuseppe Carenini,et al.  Methods for mining and summarizing text conversations , 2011, SIGIR '12.

[50]  Micha Elsner,et al.  Disentangling Chat , 2010, CL.

[51]  Naomi S. Baron Always On: Language in an Online and Mobile World , 2008 .

[52]  Shafiq R. Joty,et al.  Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails , 2010, EMNLP.

[53]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[54]  Thomas P. Minka,et al.  The Dirichlet-tree distribution , 2006 .

[55]  Shafiq R. Joty,et al.  Unsupervised Modeling of Dialog Acts in Asynchronous Conversations , 2011, IJCAI.

[56]  Douglas W. Oard,et al.  Context-based Message Expansion for Disentanglement of Interleaved Text Conversations , 2009, NAACL.

[57]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[58]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[59]  Dragomir R. Radev,et al.  Graph-based Natural Language Processing and Information Retrieval , 2011 .

[60]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[61]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[62]  Johanna D. Moore,et al.  Automatic Segmentation of Multiparty Dialogue , 2006, EACL.

[63]  José Gabriel Pereira Lopes,et al.  Topic Segmentation Algorithms for Text Summarization and Passage Retrieval: An Exhaustive Evaluation , 2007, AAAI.

[64]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[65]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[66]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[67]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[68]  Hongyuan Zha,et al.  Co-ranking Authors and Documents in a Heterogeneous Network , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[69]  ChengXiang Zhai,et al.  Learning online discussion structures by conditional random fields , 2011, SIGIR.

[70]  Johanna D. Moore,et al.  Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.

[71]  James Allan,et al.  Topic Detection and Tracking , 2002, The Information Retrieval Series.

[72]  G. Carenini,et al.  A Publicly Available Annotated Corpus for Supervised Email Summarization , 2008 .

[73]  Timothy Baldwin,et al.  Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction , 2010, COLING.