Decision Tree Algorithm based on Sampling

As the size of the database increases, data mining algorithm faces more demanding requirements for efficiency and accuracy. Data mining for large data sets require large amounts of time and physical resources. Sampling is introduced as an effective method. Facing large data sets, a new decision tree algorithm based on sampling is put forward. It can select small initial samples with similar distribution to the original data sets to study, and stop sampling according to the time complexity requirements and convergence criteria. Comparing with the existing flexible decision tree algorithm, the algorithm can reduce the computation time and I/O complexity, while maintaining the accuracy of the tree.

[1]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[2]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[3]  Makoto Haraguchi,et al.  Detecting a Compact Decision Tree Based on an Appropriate Abstraction , 2000, IDEAL.

[4]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[5]  Huan Liu,et al.  Efficiently Determine the Starting Sample Size for Progressive Sampling , 2001, DMKD.

[6]  Johannes Fürnkranz,et al.  On the Use of Fast Subsampling Estimates for Algorithm Recommendation , 2002 .

[7]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[8]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Salvatore J. Stolfo,et al.  Toward Multi-Strategy Parallel & Distributed Learning in Sequence Analysis , 1993, ISMB.

[11]  Arno Sprecher,et al.  An Artificial Intelligence Approach , 1994 .

[12]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[13]  Zhang Chun Research of Sampling's Application in Data Mining , 2004 .

[14]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[15]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[16]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[17]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[18]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.