A Novel Approach for Email Clustering Based on Semantics

An increasing interest has been recently devoted to clustering short documents. Short documents don't contain enough text to compute similarities accurately by implementing the most widely used technique called Vector Space Model (VSM). Adding semantics to short documents clustering is one efficient way to solve this problem. However, real life collections are often composed of very short or long documents. For example, the length of email messages for each email user follows a power-law distribution. Long emails and short emails both appear in email corpus. Therefore, both state-of-the-art short documents and long document clustering approaches can't get a high cluster quality or high efficiency in short and long documents clustering. In order to solve this problem, we propose a novel approach for email clustering based on semantics. Empirical validation shows that our method can obtain high cluster quality and high efficiency in real world email datasets.

[1]  Srinivasan Parthasarathy,et al.  Symmetrizations for clustering directed graphs , 2011, EDBT/ICDT '11.

[2]  Alexandra Cernian,et al.  The design and validation of an automatic email clustering system based on semantics , 2011, Proceedings of the 6th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems.

[3]  John C. Platt,et al.  Automatic Discovery of Personal Topics to Organize Email , 2005, CEAS.

[4]  Ozgur Turetken,et al.  A multi-attribute, multi-weight clustering approach to managing "e-mail overload" , 2006, Decis. Support Syst..

[5]  Susumu Kuno,et al.  Computational Linguistics: Graphical input/output of nonstandard characters , 1968, CACM.

[6]  Wanlei Zhou,et al.  Managing Email Overload with an Automatic Nonparametric Clustering Approach , 2007, NPC.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Hua Li,et al.  Adding Semantics to Email Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Yihong Gong,et al.  Directed Network Community Detection: A Popularity and Productivity Link Model , 2010, SDM.

[12]  Peter Haider,et al.  Bayesian clustering for email campaign detection , 2009, ICML '09.

[13]  Jing Peng,et al.  A Clustering Algorithm for Short Documents Based On Concept Similarity , 2007, 2007 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[14]  Qun Liu,et al.  基於《知網》的辭彙語義相似度計算 (Word Similarity Computing Based on How-net) [In Chinese] , 2002, ROCLING/IJCLCLP.