Improving semi-supervised fuzzy c-means classification of Breast Cancer data using feature selection

In previous work, six clinically novel and useful subgroups of breast cancer were identified using rules and clinicians' expertise to combine solutions from three different clustering algorithms on a database of biomarkers. The motivation for the present work is to reproduce this classification using a single clustering method. In the long term, we hope to produce a clinically useful classification using fewer features (biomarkers), reducing the time and cost of running complex and expensive clinical tests. Hence, the aim of this paper is to investigate the use of feature selection in combination with ssFCM to reduce the number of features while maintaining accuracy (defined as agreement with the previous classification), both on our breast cancer biomarker data and on other benchmark datasets. We show experimental results using four feature selection techniques, exploring with 10, 15 and 17 selected features out of the original 25 biomarkers for breast cancer. We experimented with varying amounts of labelled data (10% - 60% of the training data) and we evaluate classification accuracy using cross-validation. It was found that classification accuracy increased using 15 or 17 breast cancer biomarkers. Using SVM-RFE and CFS, improved classification accuracy was found on three UCI datasets, Arrhythmia, Cardiotocography and Yeast.

[1]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[2]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.

[3]  Andrew Y. Ng,et al.  Preventing "Overfitting" of Cross-Validation Data , 1997, ICML.

[4]  Witold Pedrycz,et al.  Fuzzy clustering with partial supervision , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[5]  M. Benkhalifa,et al.  Text categorization using the semi-supervised fuzzy c-means algorithm , 1999, 18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397).

[6]  Jong-Min Park,et al.  Analysis of active feature selection in optic nerve data using labeled fuzzy C-means clustering , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[7]  Thomas A. Runkler,et al.  Classification and prediction of road traffic using application-specific fuzzy clustering , 2002, IEEE Trans. Fuzzy Syst..

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[9]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[11]  G. Ball,et al.  High‐throughput protein expression analysis using tissue microarray technology of a large well‐characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses , 2005, International journal of cancer.

[12]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[13]  Qinghua Hu,et al.  Improved Feature Selection Algorithm Based on SVM and Correlation , 2006, ISNN.

[14]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[15]  Max Kuhn,et al.  The caret Package , 2007 .

[16]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Paulo J. G. Lisboa,et al.  A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients , 2010, Comput. Biol. Medicine.

[19]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[20]  Jonathan M. Garibaldi,et al.  Investigating Distance Metrics in Semi-supervised Fuzzy c-Means for Breast Cancer Classification , 2012, CIBB.

[21]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[22]  G. Ball,et al.  Nottingham Prognostic Index Plus (NPI+): a modern clinical decision making tool in breast cancer , 2014, British Journal of Cancer.