J-PMCRI: A Methodology for Inducing Pre-pruned Modular Classification Rules

Inducing rules from very large datasets is one of the most challenging areas in data mining. Several approaches exist to scaling up classification rule induction to large datasets, namely data reduction and the parallelisation of classification rule induction algorithms. In the area of parallelisation of classification rule induction algorithms most of the work has been concentrated on the Top Down Induction of Decision Trees (TDIDT), also known as the ‘divide and conquer’ approach. However powerful alternative algorithms exist that induce modular rules. Most of these alternative algorithms follow the ‘separate and conquer’ approach of inducing rules, but very little work has been done to make the ‘separate and conquer’ approach scale better on large training data. This paper examines the potential of the recently developed blackboard based J-PMCRI methodology for parallelising modular classification rule induction algorithms that follow the ‘separate and conquer’ approach. A concrete implementation of the methodology is evaluated empirically on very large datasets.

[1]  Werner Dubitzky,et al.  Grid warehousing of molecular dynamics protein unfolding data , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[2]  Max Bramer,et al.  Automatic Induction of Classification Rules from Examples Using N-Prism , 2000 .

[3]  Frans Coenen,et al.  Research and Development in Intelligent Systems XVI , 2000, Springer London.

[4]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[5]  JoBea Way,et al.  The evolution of synthetic aperture radar systems and their progression to the EOS SAR , 1991, IEEE Trans. Geosci. Remote. Sens..

[6]  Rudi Studer,et al.  Intelligent Information Processing , 2002, IFIP — The International Federation for Information Processing.

[7]  Donald Hamilton,et al.  The evolving universe. Selected topics on large-scale structure and on the properties of galaxies , 1998 .

[8]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[9]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[10]  Max Bramer,et al.  An Information-Theoretic Approach to the Pre-pruning of Classification Rules , 2002, Intelligent Information Processing.

[11]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[12]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[13]  Werner Dubitzky,et al.  Towards Data Warehousing and Mining of Protein Unfolding Simulation Data , 2005, Journal of Clinical Monitoring and Computing.

[14]  Frans Coenen,et al.  Research and Development in Intelligent Systems XXV, Proceedings of AI-2008, the Twenty-eighth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 9-11 December, 2008 , 2009, SGAI Conf..

[15]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[16]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[17]  Mo Adda,et al.  Parallel Rule Induction with Information Theoretic Pre-Pruning , 2009, SGAI Conf..

[18]  Georges Gardarin,et al.  Advances in Database Technology — EDBT '96 , 1996, Lecture Notes in Computer Science.

[19]  Adrian A. Hopgood,et al.  DARBS: A Distributed Blackboard System , 2001 .

[20]  Max Bramer,et al.  Inducer: a public domain workbench for data mining , 2005, Int. J. Syst. Sci..

[21]  Rebecca Whitaker Msfc The Evolving Universe , 2008 .