On Distributed Fuzzy Decision Trees for Big Data

Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis.

[1]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Mahdi Eftekhari,et al.  Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers , 2014, Inf. Sci..

[4]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[5]  Francesco Marcelloni,et al.  A new approach to fuzzy random forest generation , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[6]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[7]  Hong Yan,et al.  Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition , 1996, Advances in Fuzzy Systems - Applications and Theory.

[8]  Cezary Z. Janikow,et al.  Fuzzy decision trees: issues and methods , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[9]  Francesco Marcelloni,et al.  A MapReduce solution for associative classification of big data , 2016, Inf. Sci..

[10]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[11]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[12]  Cheng Xueqi,et al.  Survey on Big Data System and Analytic Technology , 2014 .

[13]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  Onur Dikmen,et al.  Parallel univariate decision trees , 2007, Pattern Recognit. Lett..

[16]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[17]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[18]  Maozhen Li,et al.  A MapReduce based parallel SVM for large scale spam filtering , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[19]  Richard Weber,et al.  Fuzzy-ID3: A class of methods for automatic knowledge acquisition , 1992 .

[20]  Ethem Alpaydin,et al.  Linear Discriminant Trees , 2000, ICML.

[21]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[22]  李俊杰,et al.  Scalable Random Forests for Massive Data , 2012 .

[23]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[24]  Witold Pedrycz,et al.  Extraction of fuzzy rules from fuzzy decision trees: An axiomatic fuzzy sets (AFS) approach , 2013, Data Knowl. Eng..

[25]  Francisco Herrera,et al.  On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data , 2014, 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[26]  Cezary Z. Janikow,et al.  A genetic algorithm method for optimizing fuzzy decision trees , 1996 .

[27]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[28]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[29]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[30]  Navjot Kaur,et al.  Cloud-deployable health data mining using secured framework for Clinical decision support system , 2015, 2015 International Conference and Workshop on Computing and Communication (IEMCON).

[31]  Daniel Sánchez,et al.  Building multi-way decision trees with numerical attributes , 2004, Inf. Sci..

[32]  Francisco Herrera,et al.  Interpretability of linguistic fuzzy rule-based systems: An overview of interpretability measures , 2011, Inf. Sci..

[33]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[34]  Ruoming Jin,et al.  Communication and Memory Efficient Parallel Decision Tree Construction , 2003, SDM.

[35]  Pietro Ducange,et al.  A MapReduce-based fuzzy associative classifier for big data , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[36]  Ruisheng Diao,et al.  Decision Tree-Based Online Voltage Security Assessment Using PMU Measurements , 2009, IEEE Transactions on Power Systems.

[37]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[38]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  Maozhen Li,et al.  A MapReduce-based distributed SVM algorithm for automatic image annotation , 2011, Comput. Math. Appl..

[41]  Xizhao Wang,et al.  A comparative study on heuristic algorithms for generating fuzzy decision trees , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[42]  M. Shaw,et al.  Induction of fuzzy decision trees , 1995 .

[43]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[44]  Xiao Liu,et al.  A DT-SVM Strategy for Stock Futures Prediction with Big Data , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[45]  Xing Xie,et al.  Learning transportation mode from raw gps data for geographic applications on the web , 2008, WWW.

[46]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[47]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[48]  Steven D. Brown,et al.  Induction of decision trees using fuzzy partitions , 2003 .

[49]  B. Chandra,et al.  Fuzzy SLIQ Decision Tree Algorithm , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  Francisco Herrera,et al.  A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules , 2015, Int. J. Comput. Intell. Syst..

[51]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[52]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[53]  Yu-Lin He,et al.  Learning ELM-Tree from big data based on uncertainty reduction , 2015, Fuzzy Sets Syst..

[54]  Hisao Ishibuchi,et al.  Classification and modeling with linguistic information granules - advanced approaches to linguistic data mining , 2004, Advanced information processing.

[55]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[56]  Xizhao Wang,et al.  Parallel Ordinal Decision Tree Algorithm and Its Implementation in Framework of MapReduce , 2014, ICMLC.

[57]  Hisao Ishibuchi,et al.  Rule weight specification in fuzzy rule-based classification systems , 2005, IEEE Transactions on Fuzzy Systems.

[58]  Wei Dai,et al.  A MapReduce Implementation of C4.5 Decision Tree Algorithm , 2014 .

[59]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[60]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[61]  Tao Wang,et al.  A survey of fuzzy decision tree classifier , 2009 .

[62]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[63]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[64]  Louis Wehenkel,et al.  Automatic induction of fuzzy decision trees and its application to power system security assessment , 1999, Fuzzy Sets Syst..

[65]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[66]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[67]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.