Multi-GPU approach to global induction of classification trees for large-scale data mining

This paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

[1]  Kishor K. Bhoyar,et al.  An improved multiclass support vector machine classifier using reduced hyper-plane with skewed binary tree , 2018, Applied Intelligence.

[2]  Alex Alves Freitas,et al.  Evolutionary Design of Decision-Tree Algorithms Tailored to Microarray Gene Expression Data Sets , 2014, IEEE Transactions on Evolutionary Computation.

[3]  Huaguang Zhang,et al.  A novel framework of fuzzy oblique decision tree construction for pattern classification , 2020, Applied Intelligence.

[4]  Jie Cao,et al.  A novel parallel accelerated CRPF algorithm , 2019, Applied Intelligence.

[5]  Damjan Strnad,et al.  Parallel construction of classification trees on a GPU , 2016, Concurr. Comput. Pract. Exp..

[6]  Jaume Bacardit,et al.  Speeding up the evaluation of evolutionary learning systems using GPGPUs , 2010, GECCO '10.

[7]  Marek Kretowski,et al.  Cost-sensitive Global Model Trees applied to loan charge-off forecasting , 2015, Decis. Support Syst..

[8]  Reynold Xin,et al.  Apache Spark , 2016 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Eibe Frank,et al.  Accelerating the XGBoost algorithm using GPU computing , 2017, PeerJ Comput. Sci..

[11]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[12]  Alessandra Alaniz Macedo,et al.  A tree-based algorithm for attribute selection , 2017, Applied Intelligence.

[13]  Shyan-Ming Yuan,et al.  CUDT: A CUDA Based Decision Tree Algorithm , 2014, TheScientificWorldJournal.

[14]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Decision-Tree Induction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[16]  Hong Xie,et al.  GMMA: GPU-based multiobjective memetic algorithms for vehicle routing problem with route balancing , 2018, Applied Intelligence.

[17]  Pierre Collet,et al.  Massively Parallel Evolutionary Computation on GPGPUs , 2013, Natural Computing Series.

[18]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[19]  Pradipta Kishore Dash,et al.  Classification of power quality data using decision tree and chemotactic differential evolution based fuzzy clustering , 2012, Swarm Evol. Comput..

[20]  Marek Kretowski,et al.  Evolutionary induction of a decision tree for large-scale data: a GPU-based approach , 2017, Soft Comput..

[21]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Yu Lei,et al.  Investigations of a GPU-based levy-firefly algorithm for constrained optimization of radiation therapy treatment planning , 2016, Swarm Evol. Comput..

[23]  Duane W. Storti,et al.  CUDA for Engineers: An Introduction to High-Performance Parallel Computing , 2015 .

[24]  Robert Strzodka Abstraction for AoS and SoA layout in C , 2011 .

[25]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[26]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[27]  El-Ghazali Talbi,et al.  GPU-based island model for evolutionary algorithms , 2010, GECCO '10.

[28]  Natasa Przulj,et al.  Integrative methods for analyzing big data in precision medicine , 2016, Proteomics.

[29]  Marek Kretowski,et al.  Evolutionary induction of global model trees with specialized operators and memetic extensions , 2014, Inf. Sci..

[30]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[31]  Sung Wook Baik,et al.  SPPC: a new tree structure for mining erasable patterns in data streams , 2018, Applied Intelligence.

[32]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[33]  Baoqun Yin,et al.  A modified artificial bee colony approach for the 0-1 knapsack problem , 2018, Applied Intelligence.

[34]  Bingsheng He,et al.  Exploiting GPUs for Efficient Gradient Boosting Decision Tree Training , 2019, IEEE Transactions on Parallel and Distributed Systems.

[35]  Dietmar Fey,et al.  Performance investigations of genetic algorithms on graphics cards , 2013, Swarm Evol. Comput..

[36]  Alberto Cano,et al.  A survey on graphic processing unit computing for large‐scale data mining , 2018, WIREs Data Mining Knowl. Discov..

[37]  José Duato,et al.  Accurately modeling the on-chip and off-chip GPU memory subsystem , 2017, Future Gener. Comput. Syst..

[38]  Darren M. Chitty Fast parallel genetic programming: multi-core CPU versus many-core GPU , 2012, Soft Comput..

[39]  Håkan Grahn,et al.  CudaRF: A CUDA-based implementation of Random Forests , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[40]  Yi-Hung Liu,et al.  Decision tree induction with a constrained number of leaf nodes , 2016, Applied Intelligence.

[41]  Jianfeng Wang,et al.  GPU Solutions to Multi-scale Problems in Science and Engineering , 2011 .

[42]  Shih-Wei Lin,et al.  An enhanced ant colony optimization (EACO) applied to capacitated vehicle routing problem , 2010, Applied Intelligence.

[43]  Vipin Kumar,et al.  Introduction to Parallel Computing , 1994 .

[44]  Wei Ding,et al.  Learning weighted distance metric from group level information and its parallel implementation , 2016, Applied Intelligence.

[45]  Gianmarco De Francisci Morales,et al.  Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams , 2014, ECAI.

[46]  Jie Cao,et al.  Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria , 2018, Applied Intelligence.

[47]  Haoruo Zhang,et al.  Fast 6D object pose refinement in depth images , 2018, Applied Intelligence.

[48]  Marek Kretowski,et al.  Evolutionary Induction of Classification Trees on Spark , 2018, ICAISC.

[49]  John R. Koza,et al.  Concept Formation and Decision Tree Induction Using the Genetic Programming Paradigm , 1990, PPSN.

[50]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[51]  Marek Kretowski,et al.  Evolutionary Decision Trees in Large-Scale Data Mining , 2020, Studies in Big Data.

[52]  Sebastián Ventura,et al.  Speeding up multiple instance learning classification rules on GPUs , 2015, Knowledge and Information Systems.

[53]  Thomas Breuer,et al.  Evolution on trees: On the design of an evolution strategy for scenario-based multi-period portfolio optimization under transaction costs , 2014, Swarm Evol. Comput..

[54]  Marek Kretowski,et al.  GPU-Accelerated Evolutionary Induction of Regression Trees , 2017, TPNC.

[55]  Jaume Bacardit,et al.  Large-scale experimental evaluation of GPU strategies for evolutionary machine learning , 2016, Inf. Sci..

[56]  Aziz Nasridinov,et al.  Decision tree construction on GPU: ubiquitous parallel computing approach , 2013, Computing.

[57]  Norbert K. Semmer,et al.  Taking the chance: Core self-evaluations predict relative gain in job resources following turnover , 2016, SpringerPlus.

[58]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[59]  Marek Kretowski,et al.  Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach , 2019, Expert Syst. Appl..

[60]  Darren M. Chitty,et al.  Improving the performance of GPU-based genetic programming through exploitation of on-chip memory , 2016, Soft Comput..

[61]  Marek Kretowski,et al.  A Parallel Approach for Evolutionary Induced Decision Trees. MPI+OpenMP Implementation , 2015, ICAISC.

[62]  P. Shanti Sastry,et al.  New algorithms for learning and pruning oblique decision trees , 1999, IEEE Trans. Syst. Man Cybern. Part C.

[63]  Gang Mei,et al.  Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation , 2016, SpringerPlus.

[64]  Martín Pedemonte,et al.  PUGACE, a cellular Evolutionary Algorithm framework on GPUs , 2010, IEEE Congress on Evolutionary Computation.

[65]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[66]  Ivan Zelinka,et al.  A survey on evolutionary algorithms dynamics and its complexity - Mutual relations, past, present and future , 2015, Swarm Evol. Comput..

[67]  Marek Kretowski,et al.  Multi-GPU approach for big data mining: global induction of decision trees , 2019, GECCO.

[68]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .

[69]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[70]  Marek Kretowski,et al.  What Are the Limits of Evolutionary Induction of Decision Trees? , 2018, PPSN.

[71]  L. Chou,et al.  An empirical analysis of land property lawsuits and rainfalls , 2016, SpringerPlus.