论文信息 - The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research

The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research

Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.

Yiming Yang | Bryan Klimt | Yiming Yang | Bryan Klimt

[1] William W. Cohen. Learning Rules that Classify E-Mail , 1996 .

[2] David D. Lewis,et al. Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[3] Jeffrey O. Kephart,et al. MailCat: an intelligent assistant for organizing e-mail , 1999, AGENTS '99.

[4] H. Murakoshi,et al. Paci c Association for Computational Linguistics CONSTRUCTION OF DELIBERATION STRUCTURE IN EMAIL COMMUNICATION , 1999 .

[5] Hongjun Lu,et al. A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[6] Christopher Meek,et al. Challenges of the Email Domain for Text Classification , 2000, ICML.

[7] Jason D. M. Rennie. ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[8] Yiming Yang,et al. A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[9] J. Kay,et al. Automatic Induction of Rules of e-mail Classification , 2001 .

[10] Stan Matwin,et al. Email classification with co-training , 2011, CASCON.

[11] Elio Masciari,et al. Towards An Adaptive Mail Classifier , 2002 .