Improving KNN-based e-mail classification into folders generating class-balanced datasets.

In this paper we deal with an e-mail classification problem known as email foldering, which consists on the classification of incoming mail into the dierent folders previously created by the user. This task has received less attention in the literature than spam filtering and is quite complex due to the (usually large) cardinality (number of folders) and lack of balance (documents per class) of the class variable. On the other hand, proximity based algorithms have been used in a wide range of fields since decades ago. One of the main drawbacks of these classifiers, known as lazy classifiers, is their computational load due to their need to compute the distance of a new sample to each point in the vectorial space to decide which class it belongs to. This is why most of the developed techniques for these classifiers consist on edition and condensation of the training set. In this work we make an approach to the problem of e-mail classification into folders. It is suggested a new algorithm based on neighbourgood called Gaussian Balanced K-NN, which does not edit nor condense the database but samples a whole new training set from the marginal gaussian distributions of the initial set. This algorithm lets choose the computational load of the classifier and also balances the training set, alleviating the same problems that edition and condensation techniques try to solve. Keywords: Balanced, class, distance, classification.

[1]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[2]  Godfried T. Toussaint,et al.  Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress , 2002 .

[3]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[4]  S. Salzberg,et al.  A weighted nearest neighbor algorithm for learning with symbolic features , 2004, Machine Learning.

[5]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[6]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[9]  José A. Gámez,et al.  Construcción de Atributos: un caso de estudio en clasificación de correo-e , 2007 .

[10]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[11]  Francesc J. Ferri,et al.  An experimental comparison between consistency-based and adaptive prototype replacement schemes , 2002, Object recognition supported by user interaction for service robots.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Jose Miguel Puerta,et al.  Attribute Construction for E-Mail Foldering by Using Wrappered Forward Greedy Search , 2007, ICEIS.

[14]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[15]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Rabab Kreidieh Ward,et al.  Vector Quantization Technique for Nonparametric Classifier Design , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[19]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .