Identifying Conversational Message Threads by Integrating Classification and Data Clustering

Conversational message thread identification regards a wide spectrum of applications, ranging from social network marketing to virus propagation, digital forensics, etc. Many different approaches have been proposed in literature for the identification of conversational threads focusing on features that are strongly dependent on the dataset. In this paper, we introduce a novel method to identify threads from any type of conversational texts overcoming the limitation of previously determining specific features for each dataset. Given a pool of messages, our method extracts and maps in a three dimensional representation the semantic content, the social interactions and the timestamp; then it clusters each message into conversational threads. We extend our previous work by introducing a deep learning approach and by performing new extensive experiments and comparisons with classical learning algorithms.

[1]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[2]  Qiang Yang,et al.  Thread detection in dynamic text message streams , 2006, SIGIR.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[5]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  Prasenjit Mitra,et al.  Event Detection and Visualization for Social Text Streams , 2007, ICWSM.

[9]  Vanessa López,et al.  A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection , 2016, DATA.

[10]  Jen-Yuan Yeh,et al.  Email Thread Reassembly Using Similarity Matching , 2006, CEAS.

[11]  Giacomo Domeniconi,et al.  GOTA: GO term annotation of biomedical literature , 2015, BMC Bioinformatics.

[12]  Prasenjit Mitra,et al.  Temporal and Information Flow Based Event Detection from Social Text Streams , 2007, AAAI.

[13]  Claudio Sartori,et al.  A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf , 2015, DATA.

[14]  David Carmel,et al.  Conversation Detection in Email Systems , 2008, ECIR.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Bin Zhou,et al.  Contextual correlation based thread detection in short text message streams , 2011, Journal of Intelligent Information Systems.

[17]  Douglas W. Oard,et al.  Indexing emails and email threads for retrieval , 2005, SIGIR '05.

[18]  ChengXiang Zhai,et al.  Learning online discussion structures by conditional random fields , 2011, SIGIR.

[19]  Erik Aumayr,et al.  Reconstruction of Threaded Conversations in Online Discussion Forums , 2011, ICWSM.

[20]  Eugene Agichtein,et al.  Discovering authorities in question answer communities by using link analysis , 2007, CIKM '07.

[21]  Craig H. Martell,et al.  Topic Detection and Extraction in Chat , 2008, 2008 IEEE International Conference on Semantic Computing.

[22]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[23]  Athman Bouguettaya,et al.  Efficient agglomerative hierarchical clustering , 2015, Expert Syst. Appl..

[24]  Sebastian Raschka,et al.  Python Machine Learning , 2015 .

[25]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  Kristof Coussement,et al.  Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors , 2007 .

[28]  Yusuke Sugomori Java deep learning essentials : dive into the future of data science and learn how to build the sophisticated algorithms that are fundamental to deep learning and AI with Java , 2016 .

[29]  Azadeh Shakery,et al.  A learning approach for email conversation thread reconstruction , 2013, J. Inf. Sci..

[30]  G. Carenini,et al.  A Publicly Available Annotated Corpus for Supervised Email Summarization , 2008 .

[31]  Kenney Ng,et al.  Auto-grouping emails for faster e-discovery , 2011, Proc. VLDB Endow..

[32]  Xia Wang,et al.  Email Conversations Reconstruction Based on Messages Threading for Multi-person , 2008, 2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing.

[33]  Faisal M. Khan,et al.  Mining Chat-room Conversations for Social and Semantic Interactions , 2002 .

[34]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[35]  Richard Colbaugh,et al.  Toward Emerging Topic Detection for Business Intelligence: Predictive Analysis of 'Meme' Dynamics , 2010, ArXiv.