Email thread identification using latent Dirichlet allocation and non-negative matrix factorization based clustering techniques

Emails are the most popular and effective way of communicating over the internet. A number of applications are available today for computers and mobile devices for email messaging. Email messaging is constantly getting more popular and, as a result, numbers of sent and received emails are also increasing. It is very difficult for a user to remember emails and relate newer incoming emails to previous communications made on similar topics. Email threads provide a mechanism using which a user can obtain sequences of emails for a particular set of communication in a time frame and provides a number of benefits to users. In this work two email thread identification algorithms based on a nested textual clustering approach are presented. The work is planned in two stages; in the first stage two popular text clustering approaches, latent Dirichlet allocation and non-negative matrix factorization, are applied over the email messages to form the email clusters. Then in the second stage, clustering is again performed over the created email clusters to identify the email threads using threading features. Performance parameters like accuracy, precision, recall and F-measure are evaluated for the presented thread identification algorithms.

[1]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[2]  Bernard Kerr Thread Arcs: an email thread visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[3]  Michael S. Bernstein,et al.  EmailValet: managing email overload through private, accountable crowdsourcing , 2013, CSCW.

[4]  Jimmy J. Lin,et al.  Single-document and multi-document summarization techniques for email threads using sentence compression , 2008, Inf. Process. Manag..

[5]  Yan Jia,et al.  Conversation Extraction in Dynamic Text Message Stream , 2008, J. Comput..

[6]  Pablo Ariel Duboué Extractive email thread summarization: Can we do better than He Said She Said? , 2012, INLG.

[7]  Shafiq R. Joty,et al.  Unsupervised Modeling of Dialog Acts in Asynchronous Conversations , 2011, IJCAI.

[8]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[9]  Junpeng Chen,et al.  Topic sense induction from social tags based on non-negative matrix factorization , 2014, Inf. Sci..

[10]  Qiang Yang,et al.  Thread detection in dynamic text message streams , 2006, SIGIR.

[11]  Azadeh Shakery,et al.  An Evolutionary-Based Method for Reconstructing Conversation Threads in Email Corpora , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[12]  Ani Nenkova,et al.  Facilitating email thread access by extractive summary generation , 2003, RANLP.

[13]  M. Asadpour,et al.  A Supervised Approach to Predict the Hierarchical Structure of Conversation Threads for Comments , 2014, TheScientificWorldJournal.

[14]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[15]  Patricia K. Falk Tech Services on the Web: MALLET-MAchine Learning for LanguagE Toolkit; http://mallet.cs.umass.edu/ , 2014 .

[16]  Heshaam Faili,et al.  A Supervised Approach for Reconstructing Thread Structure in Comments on Blogs and Online News Agencies (El enfoque supervisado para reconstrucción de la estructura de hilos en comentarios en blogs y agencias de noticias en línea) , 2013, Computación y Sistemas.

[17]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[18]  Azadeh Shakery,et al.  A learning approach for email conversation thread reconstruction , 2013, J. Inf. Sci..

[20]  Kenney Ng,et al.  Auto-grouping emails for faster e-discovery , 2011, Proc. VLDB Endow..

[21]  Xia Wang,et al.  Email Conversations Reconstruction Based on Messages Threading for Multi-person , 2008, 2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing.

[22]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[23]  Jaegul Choo,et al.  UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[24]  Stephen Wan,et al.  Generating Overview Summaries of Ongoing Email Thread Discussions , 2004, COLING.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Jen-Yuan Yeh,et al.  Email Thread Reassembly Using Similarity Matching , 2006, CEAS.

[27]  Owen Rambow,et al.  Summarizing Email Threads , 2004, NAACL.

[28]  Roger Wattenhofer,et al.  BuzzTrack: topic detection and tracking in email , 2007, IUI '07.

[29]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.