A comparative study on feature selection for a risk prediction model for colorectal cancer

BACKGROUND AND OBJECTIVE Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power. METHODS This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar stability metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge. RESULTS The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the full feature set. The visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable. CONCLUSIONS This study demonstrates that stability and model performance should be studied jointly as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable while achieving good model performance.

[1]  Su Ruan,et al.  Feature selection for outcome prediction in oesophageal cancer using genetic algorithm and random forest classifier , 2017, Comput. Medical Imaging Graph..

[2]  I. Thompson,et al.  Cancer Incidence and Mortality , 2013 .

[3]  Poonam Chaudhari,et al.  Improving Feature Selection Using Elite Breeding QPSO on Gene Data set for Cancer Classification , 2018 .

[4]  Mehmet Fatih Akay,et al.  Support vector machines combined with feature selection for breast cancer diagnosis , 2009, Expert Syst. Appl..

[5]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[6]  Joan Lu,et al.  University of Huddersfield Repository Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis Examining Applying High Performance Genetic Data Feature Selection and Classification Algorithms for Colon Cancer Diagnosis , 2022 .

[7]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[9]  Gang Huang,et al.  intelligent decision support algorithm for diagnosis of olorectal cancer through serum tumor markers , 2010 .

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[11]  Vicente Martín,et al.  Population-based multicase-control study in common tumors in Spain (MCC-Spain): rationale and study design. , 2015, Gaceta sanitaria.

[12]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[13]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[15]  E. Kuipers,et al.  Colorectal cancer screening: a global overview of existing programmes , 2015, Gut.

[16]  Gavin Brown,et al.  Measuring the Stability of Feature Selection , 2016, ECML/PKDD.

[17]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[18]  Philippe Autier,et al.  Trends in colorectal cancer mortality in Europe: retrospective analysis of the WHO mortality database , 2015, BMJ : British Medical Journal.

[19]  D M Parkin,et al.  Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods , 2018, International journal of cancer.

[20]  Eva Ardanaz,et al.  Risk Model for Colorectal Cancer in Spanish Population Using Environmental and Genetic Factors: Results from the MCC-Spain study , 2017, Scientific Reports.

[21]  Aung Ko Win,et al.  Risk Prediction Models for Colorectal Cancer: A Systematic Review , 2015, Cancer Prevention Research.

[22]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[23]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[24]  Nathalie Japkowicz,et al.  A Visualization-Based Exploratory Technique for Classifier Comparison with Respect to Multiple Metrics and Multiple Domains , 2008, ECML/PKDD.

[25]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[26]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[27]  Alan D. Lopez,et al.  Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-years for 32 Cancer Groups, 1990 to 2015: A Systematic Analysis for the Global Burden of Disease Study , 2017, JAMA oncology.

[28]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[29]  Ahmedin Jemal,et al.  International Trends in Colorectal Cancer Incidence Rates , 2009, Cancer Epidemiology Biomarkers & Prevention.

[30]  Chien-Hsing Chen,et al.  A hybrid intelligent model of analyzing clinical breast cancer data using clustering techniques with feature selection , 2014, Appl. Soft Comput..

[31]  Roberto Guzmán-Martínez,et al.  Feature Selection Stability Assessment Based on the Jensen-Shannon Divergence , 2011, ECML/PKDD.

[32]  Ping Yuan,et al.  Gene polymorphisms related to insulin resistance and gene–environment interaction in colorectal cancer risk , 2015, Annals of human biology.

[33]  Nicoletta Dessì,et al.  Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data , 2017, Inf. Fusion.

[34]  G. Victo Sudha George,et al.  Review on Feature Selection Techniques and the Impact of SVM for Cancer Classification using Gene Expression Profile , 2011, ArXiv.

[35]  Taghi M. Khoshgoftaar,et al.  A review of statistical and machine learning methods for modeling cancer risk using structured clinical data , 2018, Artif. Intell. Medicine.

[36]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[37]  Taghi M. Khoshgoftaar,et al.  Stability of filter- and wrapper-based software metric selection techniques , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).