Towards UCI+: A mindful repository design

Public repositories have contributed to the maturation of experimental methodology in machine learning. Publicly available data sets have allowed researchers to empirically assess their learners and, jointly with open source machine learning software, they have favoured the emergence of comparative analyses of learners' performance over a common framework. These studies have brought standard procedures to evaluate machine learning techniques. However, current claims-such as the superiority of enhanced algorithms-are biased by unsustained assumptions made throughout some praxes. In this paper, the early steps of the methodology, which refer to data set selection, are inspected. Particularly, the exploitation of the most popular data repository in machine learning-the UCI repository-is examined. We analyse the type, complexity, and use of UCI data sets. The study recommends the design of a mindful data repository, UCI+, which should include a set of properly characterised data sets consisting of a complete and representative sample of real-world problems, enriched with artificial benchmarks. The ultimate goal of the UCI+ is to lay the foundations towards a well-supported methodology for learner assessment.

[1]  Ray J. Solomonoff,et al.  The Kolmogorov Lecture* The Universal Distribution and Machine Learning , 2003, Comput. J..

[2]  Filippo Neri,et al.  Learning in the “Real World” , 1998, Machine Learning.

[3]  Peter A. Flach,et al.  Improved Dataset Characterisation for Meta-learning , 2002, Discovery Science.

[4]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Pat Langley,et al.  The changing science of machine learning , 2011, Machine Learning.

[7]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[8]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[9]  T. Subba Rao,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB , 2004 .

[10]  João Gama,et al.  Characterizing the Applicability of Classification Algorithms Using Meta-Level Learning , 1994, ECML.

[11]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[12]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[13]  Ron Kohavi,et al.  Data Mining using MLC , 1996 .

[14]  Lutz Prechelt,et al.  PROBEN 1 - a set of benchmarks and benchmarking rules for neural network training algorithms , 1994 .

[15]  José Martínez Sotoca,et al.  An analysis of how training data complexity affects the nearest neighbor classifiers , 2007, Pattern Analysis and Applications.

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Jan M. Maciejowski,et al.  Model discrimination using an algorithmic information criterion , 1979, Autom..

[18]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[19]  Norbert Jankowski,et al.  Universal Meta-Learning Architecture and Algorithms , 2011, Meta-Learning in Computational Intelligence.

[20]  Ricardo Vilalta,et al.  Meta-Learning - Concepts and Techniques , 2010, Data Mining and Knowledge Discovery Handbook.

[21]  Tin Kam Ho,et al.  Learner excellence biased by data set selection: A case for data characterisation and artificial data sets , 2013, Pattern Recognit..

[22]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[23]  Núria Macià Antolínez Data complexity in supervised learning: A far-reaching implication , 2011 .

[24]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[25]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[26]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[27]  Giovanna Castellano,et al.  Mindful: A framework for Meta-INDuctive neuro-FUzzy Learning , 2008, Inf. Sci..

[28]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[29]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[30]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[31]  Núria Macià,et al.  On the Dimensions of Data Complexity through Synthetic Data Sets , 2008, CCIA.

[32]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[33]  Robert P. W. Duin,et al.  Feature-Based Dissimilarity Space Classification , 2010, ICPR Contests.

[34]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[35]  Ian Witten,et al.  Data Mining , 2000 .

[36]  Núria Macià,et al.  In search of targeted-complexity problems , 2010, GECCO '10.

[37]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..