Rule Generation and Rule Selection Techniques for Cost-Sensitive Associative Classification

Classification aims to assign a data object to its appropriate class, what is traditionally performed through a small dataset model such as decision tree. Associative classification is a novel strategy for performing this task where the model is composed of a particular set of association rules, in which the consequent of each rule (i.e., its right-hand-side) is restricted to the classification class attribute. Rule generation and rule selection are two major issues in associative classification. Rule generation aims to find a set of association rules that better describe the entire dataset, while rule selection aims to select, for a particular case, the best rule among all rules generated. Rule generation and rule selection techniques dramatically affect the effectiveness of the classifier. In this paper we propose new techniques for rule generation and rule selection. In our proposed technique, rules are generated based on the concept of maximal frequent class itemsets (increasing the size of the rule pattern), and then selected based on their informative value and on the cost that an error imply (possibly reducing misclassifications). We validate our techniques using two important real world problems: spam detection and protein homology detection. Further, we compare our techniques against other existing ones, ranging from well known naive-Bayes to domain-specific classifiers. Experimental results show that our techniques are able to achieve a significant improvement of 30% in the effectiveness of the classification.

[1]  Ian Ledsham Essential Classification , 2005, Program.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  David Gelperin,et al.  The optimality of A , 1988 .

[4]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[5]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[6]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[7]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[8]  Khin Haymar Saw Hla,et al.  Mining frequent patterns from XML data , 2005, 6th Asia-Pacific Symposium on Information and Telecommunication Technologies.

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[11]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[12]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[13]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[14]  Elena Baralis,et al.  Essential classification rule sets , 2004, TODS.

[15]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[16]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[17]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[18]  Thorsten Joachims,et al.  KDD-Cup 2004: results and analysis , 2004, SKDD.

[19]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[20]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[21]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[22]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.