Prototype–Based Classification in Unbalanced Biomedical Problems

Medical diagnosis can be easily assimilated to a classification problem devoted at identifying the presence or not of a disease. Since a pathology is often much rarer than the healthy condition, medical diagnosis may require a classifier to cope with the problem of under-represented classes. Class imbalance, which has revealed rather common in many other application domains, contravenes the traditional assumption of machine learning methods about the similar prior probabilities of target classes. In this respect, due to their unrestricted generalization ability, classifiers such as decision trees and Naive Bayesian are not the proper classification methods. On the contrary, the basic feature of case-based classifiers to reason on representative samples of each class makes them appear a more suitable method for such a task. In this chapter, the behavior of a case-based classifier, ProtoClass, on unbalanced biomedical classification problems is evaluated in different settings of the case-base configuration. Comparison with other classification methods showed the effectiveness of such an approach to unbalanced classification problems and, hence, to medical diagnostic classification.

[1]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[2]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[3]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[4]  T.M. Padmaja,et al.  Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Barry Smyth,et al.  Modelling the Competence of Case-Bases , 1998, EWCBR.

[7]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[8]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[9]  Paul Horton,et al.  Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier , 1997, ISMB.

[10]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[11]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[12]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[13]  Petra Perner,et al.  A comparison between neural networks and decision trees based on data from industrial radiographic testing , 2001, Pattern Recognit. Lett..

[14]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[15]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.

[16]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[17]  David W. Aha,et al.  Weighting Features , 1995, ICCBR.

[18]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[19]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[20]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[21]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[22]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[23]  Petra Perner Methods for Data Mining , 2003 .

[24]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[25]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[26]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[27]  Li Yan,et al.  A New Method of Support Vector Machine for Class Imbalance Problem , 2009, 2009 International Joint Conference on Computational Sciences and Optimization.

[28]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[29]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[30]  Petra Perner,et al.  Data Mining on Multimedia Data , 2002, Lecture Notes in Computer Science.

[31]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[32]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[33]  L.M. Patnaik,et al.  Genetic Algorithm with Characteristic Amplification through Multiple Geographically Isolated Populations and Varied Fitness Landscapes , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[34]  Petra Perner,et al.  Prototype-based classification , 2008, Applied Intelligence.

[35]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.