论文信息 - A New Sampling Strategy for Building Decision Trees from Large Databases

A New Sampling Strategy for Building Decision Trees from Large Databases

We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many numerical attributes which must be discretized at each step. Successive samples are used, one on each tree node. Applying the method to a simulated database (virtually infinite size) confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speeds up the mining process while maintaining the accuracy of the classifier.

Ricco Rakotomalala | Jean-Hugues Chauchat | J. Chauchat | R. Rakotomalala

[1] Hannu Toivonen,et al. Sampling Large Databases for Association Rules , 1996, VLDB.

[2] Sabine Loudcher,et al. FUSINTER: A Method for Discretization of Continuous Attributes , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[3] Padhraic Smyth,et al. Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[4] Jeffrey Scott. An Efficient Algorithm for Sequential Random Sampling , 1987 .

[5] J. R. Quinlan,et al. Comparing connectionist and symbolic learning methods , 1994, COLT 1994.

[6] William W. Cohen. Fast Effective Rule Induction , 1995, ICML.

[7] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[8] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.