论文信息 - Evaluating associative classification algorithms for Big Data

Evaluating associative classification algorithms for Big Data

BackgroundAssociative Classification, a combination of two important and different fields (classification and association rule mining), aims at building accurate and interpretable classifiers by means of association rules. A major problem in this field is that existing proposals do not scale well when Big Data are considered. In this regard, the aim of this work is to propose adaptations of well-known associative classification algorithms (CBA and CPAR) by considering different Big Data platforms (Spark and Flink).ResultsAn experimental study has been performed on 40 datasets (30 classical datasets and 10 Big Data datasets). Classical data have been used to find which algorithms perform better sequentially. Big Data dataset have been used to prove the scalability of Big Data proposals. Results have been analyzed by means of non-parametric tests. Results proved that CBA-Spark and CBA-Flink obtained interpretable classifiers but it was more time consuming than CPAR-Spark or CPAR-Flink. In this study, it was demonstrated that the proposals were able to run on Big Data (file sizes up to 200 GBytes). The analysis of different quality metrics revealed that no statistical difference can be found for these two approaches. Finally, three different metrics (speed-up, scale-up and size-up) have also been analyzed to demonstrate that the proposals scale really well on Big Data.ConclusionsThe experimental study has revealed that sequential algorithms cannot be used on large quantities of data and approaches such as CBA-Spark, CBA-Flink, CPAR-Spark or CPAR-Flink are required. CBA has proved to be very useful when the main goal is to obtain highly interpretable results. However, when the runtime has to be minimized CPAR should be used. No statistical difference could be found between the two proposals in terms of quality of the results except for the interpretability of the final classifiers, CBA being statistically better than CPAR.

[1] Jian Pei,et al. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[2] María José del Jesús,et al. KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[3] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[4] Xindong Wu,et al. Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5] Elena Baralis,et al. Scaling associative classification for very large datasets , 2017, Journal of Big Data.

[6] Sebastián Ventura,et al. Exhaustive search algorithms to mine subgroups on Big Data using Apache Spark , 2017, Progress in Artificial Intelligence.

[7] Ananth Grama,et al. Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions , 2016, BMC Systems Biology.

[8] Bing Liu,et al. Classification Using Association Rules: Weaknesses and Enhancements , 2001 .

[9] William W. Cohen. Fast Effective Rule Induction , 1995, ICML.

[10] David J. DeWitt,et al. Parallel database systems: the future of high performance database systems , 1992, CACM.

[11] Francesco Marcelloni,et al. A MapReduce solution for associative classification of big data , 2016, Inf. Sci..

[12] Frances S. Grodzinsky,et al. Era of big data , 2016, SIGCAS Comput. Soc..

[13] Davide Anguita,et al. SLT-Based ELM for Big Social Data Analysis , 2017, Cognitive Computation.

[14] Sebastián Ventura,et al. Supervised Descriptive Pattern Mining , 2018, Springer International Publishing.

[15] Arie Ben-David,et al. Comparison of classification accuracy using Cohen's Weighted Kappa , 2008, Expert Syst. Appl..

[16] Jiawei Han,et al. CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[17] Fadi A. Thabtah,et al. A review of associative classification mining , 2007, The Knowledge Engineering Review.

[18] Robert C. Holte,et al. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[19] Hojjat Adeli,et al. Nature Inspired Computing: An Overview and Some Future Directions , 2015, Cognitive Computation.

[20] Das Amrita,et al. Mining Association Rules between Sets of Items in Large Databases , 2013 .

[21] Pietro Ducange,et al. A Distributed Fuzzy Associative Classifier for Big Data , 2018, IEEE Transactions on Cybernetics.

[22] Aiko M. Hormann,et al. Programs for Machine Learning. Part I , 1962, Inf. Control..

[23] Sebastián Ventura,et al. Pattern Mining with Evolutionary Algorithms , 2016, Springer International Publishing.

[24] Kay Chen Tan,et al. A coevolutionary algorithm for rules discovery in data mining , 2006, Int. J. Syst. Sci..

[25] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[26] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[27] Jian Pei,et al. CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[28] Wynne Hsu,et al. Integrating Classification and Association Rule Mining , 1998, KDD.

[29] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[30] Peter Clark,et al. The CN2 induction algorithm , 2004, Machine Learning.

[31] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[32] Andrea Vedaldi,et al. Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] L. Ungar,et al. MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine , 2016, Scientific Reports.