Learner excellence biased by data set selection: A case for data characterisation and artificial data sets

The excellence of a given learner is usually claimed through a performance comparison with other learners over a collection of data sets. Too often, researchers are not aware of the impact of their data selection on the results. Their test beds are small, and the selection of the data sets is not supported by any previous data analysis. Conclusions drawn on such test beds cannot be generalised, because particular data characteristics may favour certain learners unnoticeably. This work raises these issues and proposes the characterisation of data sets using complexity measures, which can be helpful for both guiding experimental design and explaining the behaviour of learners.

[1]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  José Ramón Cano,et al.  Diagnose Effective Evolutionary Prototype Selection Using an Overlapping Measure , 2009, Int. J. Pattern Recognit. Artif. Intell..

[4]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[5]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[6]  Albert Orriols-Puig,et al.  Fuzzy knowledge representation study for incremental learning in data streams and classification problems , 2011, Soft Comput..

[7]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[8]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[9]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[10]  Núria Macià,et al.  In search of targeted-complexity problems , 2010, GECCO '10.

[11]  José Martínez Sotoca,et al.  An analysis of how training data complexity affects the nearest neighbor classifiers , 2007, Pattern Analysis and Applications.

[12]  Francisco Herrera,et al.  Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method , 2010, Fuzzy Sets Syst..

[13]  DebK.,et al.  A fast and elitist multiobjective genetic algorithm , 2002 .

[14]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[17]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  L. Frank,et al.  Pretopological approach for supervised learning , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[22]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[23]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation) , 2006 .

[24]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[26]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[27]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[28]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[29]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..