A clustering rule-based approach to predictive modeling

Recent discoveries using rule-based classifiers and pre-learning data clustering have helped improve classification accuracy in predictive modeling tasks. This research introduces a unique approach which combines the above techniques and studies its predictive effects. The algorithm presented in this research, a Clustering Rule-based Algorithm (CRA), first clusters the original training set using an Expectation Maximization (EM) algorithm. Then, a separate Classification and Regression Tree (CART) is trained on each individual cluster. To obtain an upper-bound on accuracy, each test instance is evaluated against all of the rules produced by each separate Tree, to determine if there exists a rule produced by one of the Trees which correctly classifies the test instance. This study reveals that a predictive accuracy of 100% was achievable. Moreover, this approach exploits the advantages of supervised and unsupervised learning to produce a more powerful and more accurate predictive model.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  M. Jambu,et al.  Cluster analysis and data analysis , 1985 .

[3]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[6]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[7]  Joydeep Ghosh,et al.  A framework for simultaneous co-clustering and learning from complex data , 2007, KDD '07.

[8]  Donald K. Wedding,et al.  Discovering Knowledge in Data, an Introduction to Data Mining , 2005, Inf. Process. Manag..

[9]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[10]  Alex Berson,et al.  Building Data Mining Applications for CRM , 1999 .

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Hong Hu,et al.  Using Association Rules to Make Rule-based Classifiers Robust , 2005, ADC.

[15]  Yoh-Han Pao,et al.  Unsupervised/supervised learning concept for 24-hour load forecasting , 1993 .

[16]  James V. Rauff Data Mining: A Tutorial-Based Primer , 2005 .

[17]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[18]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[19]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[20]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[21]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[22]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[23]  Hong Shen,et al.  Construct robust rule sets for classification , 2002, KDD.

[24]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[25]  A. J. Germond,et al.  Application of the Kohonen network to short-term load forecasting , 1993, [1993] Proceedings of the Second International Forum on Applications of Neural Networks to Power Systems.

[26]  R. Lewis An Introduction to Classification and Regression Tree (CART) Analysis , 2000 .