Speeding up Very Fast Decision Tree with Low Computational Cost

Very Fast Decision Tree (VFDT) is one of the most widely used online decision tree induction algorithms, and it provides high classification accuracy with theoretical guarantees. In VFDT, the splitattempt operation is essential for leaf-split. It is computation-intensive since it computes the heuristic measure of all attributes of a leaf. To reduce split-attempts, VFDT tries to split at constant intervals (for example, every 200 examples). However, this mechanism introduces split-delay for split can only happen at fixed intervals, which slows down the growth of VFDT and finally lowers accuracy. To address this problem, we first devise an online incremental algorithm that computes the heuristic measure of an attribute with a much lower computational cost. Then a subset of attributes is carefully selected to find a potential split timing using this algorithm. A split-attempt will be carried out once the timing is verified. By the whole process, computational cost and split-delay are lowered significantly. Comprehensive experiments are conducted using multiple synthetic and real datasets. Compared with state-of-the-art algorithms, our method reduces split-attempts by about 5 to 10 times on average with much lower split-delay, which makes our algorithm run faster and more accurate.

[1]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[2]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[3]  Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, KDD.

[4]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[5]  Ulrike von Luxburg,et al.  Proceedings of the 28th International Conference on Machine Learning, ICML 2011 , 2011, International Conference on Machine Learning.

[6]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[7]  Jin Wang,et al.  Learn Smart with Less: Building Better Online Decision Trees with Fewer Training Examples , 2019, IJCAI.

[8]  Shuliang Wang,et al.  Data Mining and Knowledge Discovery , 2005, Mathematical Principles of the Internet.

[9]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[10]  Wei Fan,et al.  Extremely Fast Decision Tree Mining for Evolving Data Streams , 2017, KDD.

[11]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[12]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[13]  Blaz Sovdat,et al.  Updating formulas and algorithms for computing entropy and Gini index on time-changing data streams , 2014, ArXiv.

[14]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[15]  Bernard Zenko,et al.  Speeding-Up Hoeffding-Based Regression Trees With Options , 2011, ICML.

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[19]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[20]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[21]  Fabio Crestani,et al.  Proceedings of the 2006 ACM symposium on Applied computing , 2006 .