k-NN Aggregation with a Stacked Email Representation

The variety in email related tasks, as well as the increase in daily email load, has created a need for automated email management tools. In this paper, we provide an empirical evaluation of representational schemes and retrieval strategies for email. In particular, we study the impact of both textual and non-textual email content for case representation applied to Email task management. Our first contribution is Stack , an email representation based on stacking. Multiple casebases are created, each using a different case representation related with attributes corresponding to semi-structured email content. A k-NN classifier is applied to each casebase and the output is used to form a new case representation. Our second contribution is a new evaluation method allowing the creation of random chronological stratified train-test trials that respect both temporal and class distribution aspects, crucial for the email domain. The Enron corpus was used to create a dataset for the email deletion prediction task. Evaluation results show significant improvements with Stack over single casebase retrieval and multiple casebases retrieval combined using majority vote.

[1]  Enric Plaza,et al.  Machine Learning: ECML 2000 , 2003, Lecture Notes in Computer Science.

[2]  Padraig Cunningham,et al.  Case Representation Issues for Case-Based Reasoning from Ensemble Research , 2001, ICCBR.

[3]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[4]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[5]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.

[6]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[7]  Susan Craw,et al.  Maintaining Retrieval Knowledge in a Case‐Based Reasoning System , 2001, Comput. Intell..

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  R. Sekar,et al.  An Approach for Detecting Self-propagating Email Using Anomaly Detection , 2003, RAID.

[10]  Václav Snásel,et al.  Social Network Problem in Enron Corpus , 2005, ADBIS Research Communications.

[11]  Stephen D. Bay Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets , 1998, ICML.

[12]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[13]  Jonathan J. Cadiz,et al.  Marked for deletion: an analysis of email data , 2003, CHI Extended Abstracts.

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Dino Pedreschi,et al.  Machine Learning: ECML 2004 , 2004, Lecture Notes in Computer Science.

[16]  Tom M. Mitchell,et al.  Learning to Classify Email into “Speech Acts” , 2004, EMNLP.

[17]  Padraig Cunningham,et al.  An Assessment of Case-Based Reasoning for Spam Filtering , 2005, Artificial Intelligence Review.

[18]  Charlene O'Hanlon,et al.  Forward Thinking , 2006, ACM Queue.

[19]  Luc Lamontagne,et al.  Textual Reuse for Email Response , 2004, ECCBR.

[20]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[21]  Robert E. Kraut,et al.  Understanding email use: predicting action on a message , 2005, CHI.

[22]  Padraig Cunningham,et al.  Diversity versus Quality in Classification Ensembles Based on Feature Selection , 2000, ECML.

[23]  John C. Tang,et al.  When Can I Expect an Email Response? A Study of Rhythms in Email Usage , 2003, ECSCW.

[24]  Hongjun Lu,et al.  A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[25]  Wendy E. Mackay,et al.  Diversity in the use of electronic mail: a preliminary inquiry , 1988, TOIS.

[26]  John Blitzer,et al.  "Sorry, I Forgot the Attachment": Email Attachment Prediction , 2006, CEAS.

[27]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.