AUTOMATIC HIERARCHICAL E-MAIL CLASSIFICATION USING ASSOCIATION RULES

The explosive growth of on-line communication, in particular e-mail communication, makes it necessary to organize the information for faster and easier processing and searching. Storing e-mail messages into hierarchically organized folders, where each folder corresponds to a separate topic, has proven to be very useful. Previous approaches to this problem use Näıve Bayesor TF-IDF-style classifiers that are based on the unrealistic term independence assumption. These methods are also contextinsensitive in that the meaning of words is independent of presence/absence of other words in the same message. It was shown that text classification methods that deviate from the independence assumption and capture context achieve higher accuracy. In this thesis, we address the problem of term dependence by building an associative classifier called Classification using Cohesion and Multiple Association Rules, or COMAR in short. The problem of context capturing is addressed by looking for phrases in message corpora. Both rules and phrases are generated using an efficient FP-growth-like approach. Since the amount of rules and phrases produced can be very large, we propose two new measures, rule cohesion and phrase cohesion, that possess the anti-monotone property which allows the push of rule and phrase pruning deeply into the process of their generation. This approach to pattern pruning proves to be much more efficient than “generate-and-prune” methods. Both unstructured text attributes and semi-structured non-text attributes, such as senders and recipients, are used for the classification. COMAR classification algorithm uses multiple rules to predict several highest probability topics for each message. Different feature selection and rule ranking methods are compared. Our studies show that the hierarchical associative classifier that utilizes phrases, multiple rules and deep rule pruning and uses biased confidence or rule cohesion for rule ranking achieves higher accuracy and is more efficient than other associative classifiers and is also more accurate than Näıve Bayes.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[3]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[4]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[7]  Robert M. Losee,et al.  Minimizing information overload: the ranking of electronic messages , 1989, J. Inf. Sci..

[8]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[12]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[13]  Hongjun Lu,et al.  A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[14]  Dimitris Meretakis,et al.  Extending naïve Bayes classifiers using long itemsets , 1999, KDD '99.

[15]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[16]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[19]  Michael J. Pazzani,et al.  Representation of electronic mail filtering profiles: a user study , 2000, IUI '00.

[20]  James Allan,et al.  Automatic Retrieval With Locality Information Using SMART , 1992, TREC.

[21]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[22]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[23]  Magnus Merkel,et al.  Knowledge-lite extraction of multi-word units with language filters and entropy thresholds , 2000, RIAO.

[24]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[25]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[26]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[27]  Jeffrey O. Kephart,et al.  SwiftFile: An Intelligent Assistant for Organizing E-Mail , 2000 .

[28]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[29]  Gary Boone,et al.  Concept features in Re:Agent, an intelligent Email agent , 1998, AGENTS '98.

[30]  Hiroki Arimura,et al.  Discovering Unordered and Ordered Phrase Association Patterns for Text Mining , 2000, PAKDD.

[31]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.

[32]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[33]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[36]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[37]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[38]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[39]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[40]  Jonathan Helfman,et al.  Ishmail: Immediate Identification of Important Information , 1995 .

[41]  Ke Wang,et al.  Growing decision trees on support-less association rules , 2000, KDD '00.

[42]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[43]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[44]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[45]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[46]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[47]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[48]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[49]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[50]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[51]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[52]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[53]  Roy Rada,et al.  Machine learning - applications in expert systems and information retrieval , 1986, Ellis Horwood series in artificial intelligence.

[54]  TuroffMurray,et al.  Structuring computer-mediated communication systems to avoid information overload , 1985 .

[55]  Kamal Ali,et al.  Partial Classification Using Association Rules , 1997, KDD.

[56]  Stan Matwin,et al.  Using Qualitative Models to Guide Inductive Learning , 1993, ICML.

[57]  Jeffrey O. Kephart,et al.  Incremental Learning in SwiftFile , 2000, ICML.

[58]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[59]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[60]  Kotagiri Ramamohanarao,et al.  Making Use of the Most Expressive Jumping Emerging Patterns for Classification , 2001, Knowledge and Information Systems.

[61]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[62]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.