A Scalable Classification Algorithm for Very Large Datasets

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

[1]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[2]  Michael J. Shaw,et al.  Inductive Learning Methods for Knowledge-Based Decision Support: A Comparative Analysis , 1990 .

[3]  Larry A. Rendell,et al.  A New Basis for State-Space Learning Systems and a Successful Implementation , 1983, Artif. Intell..

[4]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[5]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Discovery Science.

[6]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[7]  B. Kjell,et al.  Massive parallelism for massive data applications , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[8]  Ramasamy Uthurusamy,et al.  EVOLVING DATA MINING INTO SOLUTIONS FOR INSIGHTS , 2002 .

[9]  G. H. Slusser,et al.  Statistical analysis in psychology and education , 1960 .

[10]  Vipin Kumar,et al.  Parallel formulations of decision-tree classification algorithms , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[11]  J. Ross Quinlan,et al.  Knowledge acquisition from structured data: using determinate literals to assist search , 1991, IEEE Expert.

[12]  Jude Shavlik,et al.  An Approach to Combining Explanation-based and Neural Learning Algorithms , 1989 .

[13]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[14]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[15]  Sholom M. Weiss,et al.  Using Empirical Analysis to Refine Expert System Knowledge Bases , 1984, Artif. Intell..

[16]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[17]  Nitesh V. Chawla,et al.  Decision tree learning on very large data sets , 1998, SMC.

[18]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[19]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[20]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[21]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[22]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[23]  Volker Tresp,et al.  Scaling Kernel-Based Systems to Large Data Sets , 2001, Data Mining and Knowledge Discovery.

[24]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[25]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[26]  Salvatore J. Stolfo,et al.  Learning Arbiter and Combiner Trees from Partitioned Data for Scaling Machine Learning , 1995, KDD.

[27]  Alexander S. Szalay,et al.  Petabyte Scale Data Mining: Dream or Reality? , 2002, SPIE Astronomical Telescopes + Instrumentation.

[28]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[29]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[30]  Kyuseok Shim,et al.  Scalable algorithms for mining large databases , 1999, KDD '99.

[31]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[32]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[33]  A. Hiramatsu,et al.  Integrated ATM traffic control by distributed neural networks , 1990, International Symposium on Switching.

[34]  Dursun Delen,et al.  An Approach to Improve Classification Accuracy in Very Large Datasets , 2004, AMCIS.

[35]  Ramasamy Uthurusamy,et al.  Evolving data into mining solutions for insights , 2002, CACM.

[36]  Neal Leavitt,et al.  Data Mining for the Corporate Masses? , 2002, Computer.

[37]  John H. Heinrichs,et al.  Integrating web-based data mining tools with business models for knowledge management , 2003, Decis. Support Syst..