Building fast decision trees from large training sets

Decision trees are commonly used in supervised classification. Currently, supervised classification problems with large training sets are very common, however many supervised classifiers cannot handle this amount of data. There are some decision tree induction algorithms that are capable to process large training sets, however almost all of them have memory restrictions because they need to keep in main memory the whole training set, or a big amount of it. Moreover, algorithms that do not have memory restrictions have to choose a subset of the training set, needing extra time for this selection; or they require to specify the values for some parameters that could be very difficult to determine by the user. In this paper, we present a new fast heuristic for building decision trees from large training sets, which overcomes some of the restrictions of the state of the art algorithms, using all the instances of the training set without storing all of them in main memory. Experimental results show that our algorithm is faster than the most recent algorithms for building decision trees from large training sets.

[1]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[2]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[3]  Raghu Ramakrishnan,et al.  Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[4]  Ravi Kothari,et al.  A new node splitting measure for decision tree construction , 2010, Pattern Recognit..

[5]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[6]  S. Sohn,et al.  Selected tree classifier combination based on both accuracy and error diversity , 2005 .

[7]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[8]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  Carla E. Brodley,et al.  Linear Machine Decision Trees , 1991 .

[11]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[13]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[14]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[15]  José Francisco Martínez Trinidad,et al.  A New Incremental Algorithm for Induction of Multivariate Decision Trees for Large Datasets , 2008, IDEAL.

[16]  Hsing-Kuo Kenneth Pao,et al.  Model Trees for Classification of Hybrid Data Types , 2005, IDEAL.

[17]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[18]  Jie Ouyang,et al.  Induction of multiclass multifeature split decision trees from distributed data , 2009, Pattern Recognit..

[19]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[20]  Carla E. Brodley,et al.  An Incremental Method for Finding Multivariate Splits for Decision Trees , 1990, ML.

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[23]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[24]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[25]  Lei Chang,et al.  BOAI: Fast Alternating Decision Tree Induction Based on Bottom-Up Evaluation , 2008, PAKDD.

[26]  Witold Pedrycz,et al.  C-fuzzy decision trees , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[27]  Zhoujun Li,et al.  A New Fuzzy Decision Tree Classification Method for Mining High-Speed Data Streams Based on Binary Search Trees , 2007, FAW.

[28]  Cezary Z. Janikow,et al.  Fuzzy decision trees: issues and methods , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[29]  Hakan Altinçay,et al.  Decision trees using model ensemble-based nodes , 2007, Pattern Recognit..

[30]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[31]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[32]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[33]  José Francisco Martínez Trinidad,et al.  Multivariate Decision Trees Using Different Splitting Attribute Subsets for Large Datasets , 2010, Canadian Conference on AI.

[34]  João Gama,et al.  Learning decision trees from dynamic data streams , 2005, SAC '05.

[35]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[36]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[37]  Olatz Arbelaitz,et al.  Combining multiple class distribution modified subsamples in a single tree , 2007, Pattern Recognit. Lett..