A Data Mining Software Package Including Data Preparation and Reduction: KEEL

KEEL software is an open source Data Mining tool widely used in research and real life applications. Most of the algorithms described, if not all of them, throughout the book are actually implemented and publicly available in this Data Mining platform. Since KEEL enables the user to create and run single or concatenated preprocessing techniques in the data, such software is carefully introduced in this section, intuitively guiding the reader across the step needed to set up all the data preparations that might be needed. It is also interesting to note that the experimental analyses carried out in this book have been created using KEEL, allowing the consultant to quickly compare and adapt the results presented here. An extensive revision of Data Mining software tools are presented in Sect. 10.1. Among them, we will focus on the open source KEEL platform in Sect. 10.2 providing details of its main features and usage. For the practitioners interest, the most common used data sources are introduced in Sect. 10.3 and the steps needed to integrate any new algorithm in it in Sect. 10.4. Once the results have been obtained, the appropriate comparison guidelines are provided in Sect. 10.5. The most important aspects of the tool are summarized in Sect. 10.6.

[1]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[2]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[3]  Francisco Herrera,et al.  A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special Session on Real Parameter Optimization , 2009, J. Heuristics.

[4]  César Hervás-Martínez,et al.  JCLEC: a Java framework for evolutionary computation , 2007, Soft Comput..

[5]  Alex Alves Freitas,et al.  Data mining with an ant colony optimization algorithm , 2002, IEEE Trans. Evol. Comput..

[6]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[7]  Kay Chen Tan,et al.  A coevolutionary algorithm for rules discovery in data mining , 2006, Int. J. Syst. Sci..

[8]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[9]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[10]  Francisco Herrera,et al.  A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests , 2007, Expert Syst. Appl..

[11]  Francisco Herrera,et al.  A Survey on Evolutionary Instance Selection and Generation , 2010, Int. J. Appl. Metaheuristic Comput..

[12]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[13]  Ricco Rakotomalala,et al.  TANAGRA : un logiciel gratuit pour l'enseignement et la recherche , 2005, EGC.

[14]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[16]  Yvan Saeys,et al.  Java-ML: A Machine Learning Library , 2009, J. Mach. Learn. Res..

[17]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[18]  Matthias Meyer,et al.  A Generic Approach to an Object-Oriented Learning Classifier System Library , 2006, J. Artif. Soc. Soc. Simul..

[19]  Chengqi Zhang,et al.  Data preparation for data mining , 2003, Appl. Artif. Intell..

[20]  G. W. Snedecor Statistical Methods , 1964 .

[21]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[22]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[23]  Markus Hofmann,et al.  RapidMiner: Data Mining Use Cases and Business Analytics Applications , 2013 .

[24]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[26]  Ignacio Rojas,et al.  A new hybrid methodology for cooperative-coevolutionary optimization of radial basis function networks , 2007, Soft Comput..

[27]  José Bravo,et al.  Computers and education in the 21st century , 2000 .

[28]  Guangzhe Fan,et al.  Classification tree analysis using TARGET , 2008, Comput. Stat. Data Anal..

[29]  Inés Couso,et al.  Advocating the Use of Imprecisely Observed Data in Genetic Fuzzy Systems , 2007, IEEE Transactions on Fuzzy Systems.

[30]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[31]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[32]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[33]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[34]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[35]  Maya R. Gupta,et al.  Completely Lazy Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[37]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[38]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[39]  Gerhard Hommel,et al.  Multiple Hypotheses Testing , 1993 .

[40]  Xavier Llorá,et al.  Sigevolution Newsletter of the Acm Special Interest Group on Genetic and Evolutionary Computation in This Issue Ec @ Dow Chemical E2k: Evolution to Knowledge Gecco-2006 Highlights , 2022 .

[41]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[42]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[43]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[44]  Graham J. Williams Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery , 2011 .

[45]  Jesús Alcalá-Fdez,et al.  Hybrid learning models to get the interpretability–accuracy trade-off in fuzzy modeling , 2006, Soft Comput..

[46]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[47]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[48]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[49]  Jesús S. Aguilar-Ruiz,et al.  Natural Encoding for Evolutionary Supervised Learning , 2007, IEEE Transactions on Evolutionary Computation.

[50]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[51]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[52]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[53]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[54]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[55]  Eghbal G. Mansoori,et al.  SGERD: A Steady-State Genetic Algorithm for Extracting Fuzzy Classification Rules From Data , 2008, IEEE Transactions on Fuzzy Systems.