Parallel Rule Induction with Information Theoretic Pre-Pruning

In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.

[1]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[2]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Mo Adda,et al.  Parallel Induction of Modular Classification Rules , 2008, SGAI Conf..

[5]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[6]  Mo Adda,et al.  P-Prism: A Computationally Efficient Approach to Scaling up Classification Rule Induction , 2008, IFIP AI.

[7]  Mo Adda,et al.  PMCRI: A Parallel Modular Classification Rule Induction Framework , 2009, MLDM.

[8]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[9]  Adrian A. Hopgood,et al.  DARBS: A Distributed Blackboard System , 2001 .

[10]  Ryszard S. Michalski,et al.  On the Quasi-Minimal Solution of the General Covering Problem , 1969 .

[11]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[12]  Max Bramer,et al.  Inducer: a public domain workbench for data mining , 2005, Int. J. Syst. Sci..

[13]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Max Bramer,et al.  An Information-Theoretic Approach to the Pre-pruning of Classification Rules , 2002, Intelligent Information Processing.

[16]  Jaume Bacardit,et al.  Prediction of recursive convex hull class assignments for protein residues , 2008, Bioinform..

[17]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .