A high speed decision tree classifier algorithm for huge dataset

Knowledge discovery is an important tool for the intelligent business to transform data into useful information that will increase the business revenue. Data mining techniques support automatic exploration of data, and attempts to classify the patterns and trends in data, and also infer decision rules from those patterns. Classification of dataset is an important function of mining which is a supervised machine learning procedure. Scalability and efficiency of the classifier algorithm becomes a major issue of concern when we use a large dataset and requires more number of dataset parsing. In this paper, we present a scalable decision tree algorithm for classifying large dataset with high processing speed, which requires only one scan over the dataset. It overcomes the drawback of RainForest algorithm which addresses the scalability issue and requires a pass over the dataset in each level of decision tree construction. The proposed algorithm significantly reduces the IO cost and also requires one time sorting for numerical attributes which leads to a better performance in time dimension. According to the experimental results, our algorithm acquires less execution time over the RainForest algorithm and also adoptable for any attribute selection method by which the accuracy of decision tree is improved.

[1]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[2]  Ruoming Jin,et al.  Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Song Xudong,et al.  Decision Tree Algorithm based on Sampling , 2007, 2007 IFIP International Conference on Network and Parallel Computing Workshops (NPC 2007).

[4]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[5]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[6]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[7]  R. Lewis An Introduction to Classification and Regression Tree (CART) Analysis , 2000 .

[8]  Viorel Negru,et al.  A combinative method for decision tree construction , 2005, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC'05).

[9]  Tao Chen,et al.  Improved Decision Tree Algorithm: ID3+ , 2006 .

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .