Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction

Finding and removing misclassified instances are important steps in data mining and machine learning that affect the performance of the data mining algorithm in general. In this paper, we propose a C-Support Vector Classification Filter (C-SVCF) to identify and remove the misclassified instances (outliers) in breast cancer survivability samples collected from Srinagarind hospital in Thailand, to improve the accuracy of the prediction models. Only instances that are correctly classified by the filter are passed to the learning algorithm. Performance of the proposed technique is measured with accuracy and area under the receiver operating characteristic curve (AUC), as well as compared with several popular ensemble filter approaches including AdaBoost, Bagging and ensemble of SVM with AdaBoost and Bagging filters. Our empirical results indicate that C-SVCF is an effective method for identifying misclassified outliers. This approach significantly benefits ongoing research of developing accurate and robust prediction models for breast cancer survivability.

[1]  Shusaku Tsumoto Problems with mining medical data , 2000, Proceedings 24th Annual International Computer Software and Applications Conference. COMPSAC2000.

[2]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[3]  Yulei Jiang Uncertainty in the Output of Artificial Neural Networks , 2007, 2007 International Joint Conference on Neural Networks.

[4]  Joyce A. Mitchell,et al.  Using literature-based discovery to identify disease candidate genes , 2005, Int. J. Medical Informatics.

[5]  Taghi M. Khoshgoftaar,et al.  Rule-based noise detection for software measurement data , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[6]  Kevin W. Bowyer,et al.  Generating ROC curves for artificial neural networks , 1997 .

[7]  Manuel Martín-Merino,et al.  Combining SVM Classifiers for Email Anti-spam Filtering , 2007, IWANN.

[8]  Saso Dzeroski,et al.  Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[9]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[11]  Yi Wang,et al.  Breast Cancer Diagnosis via Supp ort Vector Machines , 2006, 2006 Chinese Control Conference.

[12]  Jie Chen,et al.  Mining risk patterns in medical data , 2005, KDD '05.

[13]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[14]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[15]  Shifu Chen,et al.  Identifying and Correcting Mislabeled Training Instances , 2007, Future Generation Communication and Networking (FGCN 2007).

[16]  Edward A. Feigenbaum,et al.  Artificial intelligence research , 1963, IEEE Trans. Inf. Theory.

[17]  Fabrice Muhlenbach,et al.  Improving Classification by Removing or Relabeling Mislabeled Instances , 2002, ISMIS.

[18]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[19]  Choh-Man Teng,et al.  Applying noise handling techniques to genomic data: a case study , 2003, Third IEEE International Conference on Data Mining.

[20]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[21]  Taghi M. Khoshgoftaar,et al.  The partitioning- and rule-based filter for noise detection , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[22]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[23]  Alan L. Rector,et al.  MEDICAL INFORMATICS , 1990, The Lancet.

[24]  Ian Witten,et al.  Data Mining , 2000 .

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[26]  Yanchun Zhang,et al.  An analysis of data selection methods on classifier accuracy measure , 2008 .

[27]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[28]  Xin He,et al.  Three-Class ROC Analysis—The Equal Error Utility Assumption and the Optimality of Three-Class ROC Surface Using the Ideal Observer , 2006, IEEE Transactions on Medical Imaging.

[29]  Fuchun Sun,et al.  A Writer Recognition approach Based on SVM , 2006, The Proceedings of the Multiconference on "Computational Engineering in Systems Applications".

[30]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .