Complexity curve: a graphical measure of data complexity and classifier performance

9 We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. 10

[1]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[2]  Vijay V. Raghavan,et al.  A comparison of feature selection algorithms in the context of rough classifiers , 1996, Proceedings of IEEE 5th International Fuzzy Systems.

[3]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[6]  Jeffrey D. Karpicke,et al.  The Critical Importance of Retrieval for Learning , 2008, Science.

[7]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[8]  Huan Liu,et al.  A Monotonic Measure for Optimal Feature Selection , 1998, ECML.

[9]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[10]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[11]  Kate Smith-Miles,et al.  Towards objective measures of algorithm performance across instance space , 2014, Comput. Oper. Res..

[12]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[13]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Francisco Herrera,et al.  An automatic extraction method of the domains of competence for learning classifiers using data complexity measures , 2013, Knowledge and Information Systems.

[16]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[17]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[18]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[19]  Tin Kam Ho Data Complexity Analysis: Linkage between Context and Solution in Classification , 2008, SSPR/SPR.

[20]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[21]  Tin Kam Ho,et al.  Learner excellence biased by data set selection: A case for data characterisation and artificial data sets , 2013, Pattern Recognit..

[22]  Sebastian Thrun,et al.  The MONK''s Problems-A Performance Comparison of Different Learning Algorithms, CMU-CS-91-197, Sch , 1991 .

[23]  M. Cadeddu,et al.  Defining a learning curve for laparoscopic colorectal resections , 2001, Diseases of the colon and rectum.

[24]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[25]  J. Loewenthal DECISION , 1969, Definitions.

[26]  Tin Kam Ho,et al.  Measures of Geometrical Complexity in Classification Problems , 2006 .

[27]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[28]  M. Skala Hypergeometric tail inequalities: ending the insanity , 2013, 1311.5939.

[29]  Bernd Bischl,et al.  To tune or not to tune: Recommending when to adjust SVM hyper-parameters via meta-learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[30]  Peta Wyeth,et al.  GameFlow: a model for evaluating player enjoyment in games , 2005, CIE.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  Kate Smith-Miles,et al.  Measuring instance difficulty for combinatorial optimization problems , 2012, Comput. Oper. Res..

[33]  Adam Kowalczyk,et al.  An Analysis of the Anti-learning Phenomenon for the Class Symmetric Polyhedron , 2005, ALT.

[34]  Yaser S. Abu-Mostafa,et al.  Data Complexity in Machine Learning , 2006 .

[35]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[36]  Mohamed Ibnkahla,et al.  Diversity Techniques , 2008, Encyclopedia of Wireless and Mobile Communications.

[37]  Gregory F. Nemet,et al.  Beyond the learning curve: factors influencing cost reductions in photovoltaics , 2006 .

[38]  Vasek Chvátal,et al.  The tail of the hypergeometric distribution , 1979, Discret. Math..

[39]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[40]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[41]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[42]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[43]  Christoph H. Glock,et al.  A learning curve for tasks with cognitive and motor elements , 2013, Comput. Ind. Eng..

[44]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[45]  Dimitrios Gunopulos,et al.  Feature selection for the naive bayesian classifier using decision trees , 2003, Appl. Artif. Intell..

[46]  Pedro M. Domingos A Unified Bias-Variance Decomposition for Zero-One and Squared Loss , 2000, AAAI/IAAI.

[47]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..