Exploiting Concept Clumping for Efficient Incremental News Article Categorization

We introduce a novel approach to incremental e-mail categorization based on identifying and exploiting "clumps" of messages that are classified similarly. Clumping reflects the local coherence of a classification scheme and is particularly important in a setting where the classification scheme is dynamically changing, such as in e-mail categorization. We propose a number of metrics to quantify the degree of clumping in a series of messages. We then present a number of fast, incremental methods to categorize messages and compare the performance of these methods with measures of the clumping in the datasets to show how clumping is being exploited by these methods. The methods are tested on 7 large real-world e-mail datasets of 7 users from the Enron corpus, where each message is classified into one folder. We show that our methods perform well and provide accuracy comparable to several common machine learning algorithms, but with much greater computational efficiency.

[1]  Alfred Krzywicki,et al.  Incremental E-Mail Classification and Rule Suggestion Using Simple Term Statistics , 2009, Australasian Conference on Artificial Intelligence.

[2]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[3]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[7]  Mitsuru Ishizuka,et al.  PRICAI 2002: Trends in Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[8]  Gerhard Widmer,et al.  Tracking Context Changes through Meta-Learning , 1997, Machine Learning.

[9]  Juho Rousu,et al.  Learning hierarchical multi-category text classification models , 2005, ICML.

[10]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[11]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[12]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[13]  Abraham Bernstein,et al.  Entropy-based Concept Shift Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Judy Kay,et al.  A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization , 2002, PRICAI.

[15]  Fernando Pereira,et al.  Generating summary keywords for emails using topics , 2008, IUI '08.

[16]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[17]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[18]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[19]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[20]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[21]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[22]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[23]  Anestis Gkanogiannis,et al.  A Perceptron-Like Linear Supervised Algorithm for Text Classification , 2010, ADMA.

[24]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.

[25]  Xiaodong Li,et al.  AI 2009: Advances in Artificial Intelligence, 22nd Australasian Joint Conference, Melbourne, Australia, December 1-4, 2009. Proceedings , 2009, Australasian Conference on Artificial Intelligence.

[26]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[27]  Michael Granitzer,et al.  Hierarchical Text Classication using Methods from Machine Learning , 2003 .

[28]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[29]  Alfred Krzywicki,et al.  Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization , 2010, ADMA.

[30]  Ian Witten,et al.  Data Mining , 2000 .

[31]  John Case,et al.  Predictive learning models for concept drift , 2001, Theor. Comput. Sci..

[32]  Alessandra Russo,et al.  Advances in Artificial Intelligence – SBIA 2004 , 2004, Lecture Notes in Computer Science.

[33]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[34]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[35]  Alfred Krzywicki,et al.  A Large-Scale Evaluation of an E-mail Management Assistant , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.